Structural Metadata from ArXiv Articles

Muhammad Mahbubur Rahman

September 1, 2017

ZIP Compressed File - Need an extractor? Get one here

document understanding, learning, natural language processing

The data set contains metadata extracted from more than one million arXiv articles that were put online before the end of 2016.

Data Set Characteristics: Text
Number of Instances: 1107138 arXiv articles
Size: 566 megabytes, compressed
Area: NLP and Machine Learning
Attribute Characteristics: String/Text
Associated Tasks: Classification and Clustering
Date Released: 2017-09-01
Source : arXiv repository
File format: JSON

The JSON file contains information 1,107,138 arXiv articles put online during or before 2016. Each of the top level keys in he JSON file is the arXiv article id. For each article, following information is given.


key type description
categorylistCategory names for the article
titlestringTitle of the article
authorslistName of all of the authors
summarystringA short summary of the article
linkstringLink to the original arXiv article for full contents
published_datestringDate of the publication
toclistThe table of contents for the article. This is a nested list upto few levels of subsections. Some articles may not have toc. Each title in the toc represents section/subsection/sub-subsection header and children represents next level section headers

You can view some examples of the json objects here

If you write about or use this dataset, please cite:

Muhammad Mahbubur Rahman and Tim Finin. 2017. Deep Understanding of a Document's Structure. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '17). ACM, New York, NY, USA, 63-73. DOI:"

For more information, please contact



OWL Tweet