]>
Identifying a document's logical sections and organizing them into a standard structure to understand the semantic structure of a document will not only help many information extraction applications, but also enable users to quickly navigate to sections of interest. Such an understanding of a document's structure will significantly benefit and facilitate a variety of applications, such as information extraction, document summarization, and question answering.
We intend to section large and complex PDF documents automatically and annotate each section with a semantic, human-understandable label. Our semantic labels are intended to capture the general purpose and domain-specific semantic in the large document. In a nutshell, we aim to automatically identify and classify semantic sections of documents and assign human-understandable, consistent labels to them.
We developed powerful, yet simple, approaches to build our framework using layout information and text contents extracted from documents, such as scholarly articles and RFP documents. The framework has four units: Pre-processing Unit, Annotation Unit, Classification Unit and Semantic Annotation Unit. We developed state-of-the-art machine learning and deep learning architectures. We also explored and experimented with the Latent Dirichlet Allocation (LDA), TextRank and Tensorflow Textsum models for semantic concept identification and document summarization respectively. We mapped each of the sections with a semantic name using a document ontology.
We aimed to develop a generic and domain independent framework. We used scholarly articles from the arXiv repository and RFP documents from RedShred. We evaluated the performance of our framework using different evaluation matrices, such as precision, recall, and f1-score. We also analyzed and visualized the results in the embedding space. We made available a dataset of information about a collection of scholarly articles from the arXiv eprints that includes a wide range of metadata for each article, including a TOC, section labels, section summarizations, and more.
]]>