UMBC ebiquity

Understanding the Logical and Semantic Structure of Large Documents

Authors: Muhammad Mahbubur Rahman, and Tim Finin

Book Title: SIAM International Conference on Data Mining (SDM17)

Date: April 27, 2017

Abstract: Up-to-the-minute language understanding approaches are mostly focused on small documents such as newswire articles, blog posts, product reviews and discussion forum en- tries. Understanding and extracting information from large documents such as legal docu- ments, reports, proposals, technical manuals and research articles is still a challenging task. The reason behind this challenge is that the documents may be multi-themed, complex and cover diverse topics. For example, business opportunities may contain information on the background of the business, product or service of the business, plan, team management, financial or budget related data, competitors, logistics, compliance, legal information and boilerplate content that is repeated across documents. The content can be split into multiple files or aggregated into one large file. As a result, the content in the whole document may have different structures and formats. Furthermore, the information is expressed in differ- ent forms such as paragraphs of text, headers, data forms, tables, images, mathematical equations, lists or a nested combination of these structures.

Type: Proceedings

Publisher: SIAM

Tags: learning, natural language processing

Google Scholar: search

Number of downloads: 15


Available for download as

size: 545003 bytes