
Abstract: The data set contains processed information for arXiv articles till 2016. The total number of articles is 1107138. 

Data Set Characteristics: Text
Number of Instances: 1107138 arXiv articles
Area: NLP and Machine Learning
Attribute Characteristics: String/Text
Associated Tasks: Classification and Clustering
Date Released: 09-01-2017

Source : arXiv repository

File format: JSON

Data Set description: 
The JSON file contains information about 1107138 arXiv articles till December 2016. 
Each of the top level keys in he JSON file is the arXiv article id. For each article, following informaiton is given.

key             datatype            description
===             ========            ===========
category         list            Category names for the article. 
title            string          Title of the article.
authors          list            Name of all of the authors.
summary          string          A short summary of the article.
link             string          Link to the original arXiv article for full contents.
published_date   string          Date of the publication.
toc              list            The table of contents for the article. This is a nested list upto few levels of subsections. Some articles may not have toc. 
                                 Each title in the toc represents section/subsection/sub-subsection header and children represents next level section headers. 

Relevant Paper:
Muhammad Rahman, Tim Finin, "Understanding the Logical and Semantic Structure of Large Documents"

For more information, please contact. 
Contact: mrahman1@umbc.edu







