Understanding the Logical and Semantic Structure of Large Documents

Current language understanding approaches are mostly focused on small documents, such as newswire articles, blog posts, and product reviews. Understanding and extracting information from large documents like legal documents, reports, proposals, technical manuals, and research articles is still a challenging task. Because the documents may be multi-themed, complex, and cover diverse topics. The content can be split into multiple files or aggregated into one large file. As a result, the content of the whole document may have different structures and formats. Furthermore, the information is expressed in different forms, such as paragraphs, headers, tables, images, mathematical equations, or a nested combination of these structures.

Identifying a document's logical sections and organizing them into a standard structure to understand the semantic structure of a document will not only help many information extraction applications, but also enable users to quickly navigate to sections of interest. Such an understanding of a document's structure will significantly benefit and facilitate a variety of applications, such as information extraction, document summarization, and question answering.

We intend to section large and complex PDF documents automatically and annotate each section with a semantic, human-understandable label. Our semantic labels are intended to capture the general purpose and domain-specific semantic in the large document. In a nutshell, we aim to automatically identify and classify semantic sections of documents and assign human-understandable, consistent labels to them.

We developed powerful, yet simple, approaches to build our framework using layout information and text contents extracted from documents, such as scholarly articles and RFP documents. The framework has four units: Pre-processing Unit, Annotation Unit, Classification Unit and Semantic Annotation Unit. We developed state-of-the-art machine learning and deep learning architectures. We also explored and experimented with the Latent Dirichlet Allocation (LDA), TextRank and Tensorflow Textsum models for semantic concept identification and document summarization respectively. We mapped each of the sections with a semantic name using a document ontology.

We aimed to develop a generic and domain independent framework. We used scholarly articles from the arXiv repository and RFP documents from RedShred. We evaluated the performance of our framework using different evaluation matrices, such as precision, recall, and f1-score. We also analyzed and visualized the results in the embedding space. We made available a dataset of information about a collection of scholarly articles from the arXiv eprints that includes a wide range of metadata for each article, including a TOC, section labels, section summarizations, and more.


language understanding, learning, natural language processing

PhdThesis

University of Maryland, Baltimore County

Downloads: 517 downloads

UMBC ebiquity