Dataset
Structural Metadata from ArXiv Articles
September 1, 2017
ZIP Compressed File - Need an extractor? Get one here
The data set contains metadata extracted from more than one million arXiv articles that were put online before the end of 2016.
Number of Instances: 1107138 arXiv articles
Size: 566 megabytes, compressed
Area: NLP and Machine Learning
Attribute Characteristics: String/Text
Associated Tasks: Classification and Clustering
Date Released: 2017-09-01
Source : arXiv repository
File format: JSON
The JSON file contains information 1,107,138 arXiv articles put online during or before 2016. Each of the top level keys in he JSON file is the arXiv article id. For each article, following information is given.
key | type | description |
---|---|---|
category | list | Category names for the article |
title | string | Title of the article |
authors | list | Name of all of the authors |
summary | string | A short summary of the article |
link | string | Link to the original arXiv article for full contents |
published_date | string | Date of the publication |
toc | list | The table of contents for the article. This is a nested list upto few levels of subsections. Some articles may not have toc. Each title in the toc represents section/subsection/sub-subsection header and children represents next level section headers |
You can view some examples of the json objects here.
If you write about or use this dataset, please cite:Muhammad Mahbubur Rahman and Tim Finin. 2017. Deep Understanding of a Document's Structure. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '17). ACM, New York, NY, USA, 63-73. DOI: https://doi.org/10.1145/3148055.3148080"
For more information, please contact mrahman1@umbc.edu
949 downloads
Public
Assertions
- (Publication) Deep Understanding of a Document's Structure has the dataset (Resource) Structural Metadata from ArXiv Articles