UMBC ebiquity

Structural Metadata from ArXiv Articles


The data set contains metadata extracted from more than one million arXiv articles that were put online before the end of 2016.

Data Set Characteristics: Text
Number of Instances: 1107138 arXiv articles
Size: 566 megabytes, compressed Area: NLP and Machine Learning
Attribute Characteristics: String/Text
Associated Tasks: Classification and Clustering
Date Released: 2017-09-01
Source : arXiv repository
File format: JSON

The JSON file contains information 1,107,138 arXiv articles put online during or before 2016. Each of the top level keys in he JSON file is the arXiv article id. For each article, following information is given.


key type description
categorylistCategory names for the article
titlestringTitle of the article
authorslistName of all of the authors
summarystringA short summary of the article
linkstringLink to the original arXiv article for full contents
published_datestringDate of the publication
toclistThe table of contents for the article. This is a nested list upto few levels of subsections. Some articles may not have toc. Each title in the toc represents section/subsection/sub-subsection header and children represents next level section headers

You can view some examples of the json objects here

Relevant Paper: Muhammad Rahman and Tim Finin, "Understanding the Logical and Semantic Structure of Large Documents", University of Maryland, Baltimore County

For more information, please contact

Type: Dataset

Authors: Muhammad Mahbubur Rahman

Date: September 01, 2017

Tags: document understanding, natural language processing, learning

Format: ZIP Compressed File (Need an extractor? Get one here)

Number of downloads: 49

Access Control: Publicly Available


Available for download as

(remote link)