Dataset

Structural Metadata from ArXiv Articles

September 1, 2017

ZIP Compressed File - Need an extractor? Get one here

Tags: document understanding, learning, natural language processing

The data set contains metadata extracted from more than one million arXiv articles that were put online before the end of 2016.

Data Set Characteristics: Text
Number of Instances: 1107138 arXiv articles
Size: 566 megabytes, compressed
Area: NLP and Machine Learning
Attribute Characteristics: String/Text
Associated Tasks: Classification and Clustering
Date Released: 2017-09-01
Source : arXiv repository
File format: JSON

The JSON file contains information 1,107,138 arXiv articles put online during or before 2016. Each of the top level keys in he JSON file is the arXiv article id. For each article, following information is given.

key	type	description
category	list	Category names for the article
title	string	Title of the article
authors	list	Name of all of the authors
summary	string	A short summary of the article
link	string	Link to the original arXiv article for full contents
published_date	string	Date of the publication
toc	list	The table of contents for the article. This is a nested list upto few levels of subsections. Some articles may not have toc. Each title in the toc represents section/subsection/sub-subsection header and children represents next level section headers

You can view some examples of the json objects here.

If you write about or use this dataset, please cite:

Muhammad Mahbubur Rahman and Tim Finin. 2017. Deep Understanding of a Document's Structure. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '17). ACM, New York, NY, USA, 63-73. DOI: https://doi.org/10.1145/3148055.3148080"

For more information, please contact mrahman1@umbc.edu

Counter: 1173 downloads

Access Control: Public

OWL Tweet

Assertions

(Publication) Deep Understanding of a Document's Structure has the dataset (Resource) Structural Metadata from ArXiv Articles