UMBC ebiquity

Annotations of Cybersecurity blogs and articles

Description: This data is the result of a Master Thesis Project by Ravendar Lal under the supervision of Dr. Tim Finin. This dataset can be used for training technical systems. This dataset consists of manually data for cybersecurity domain where this data collection has the articles from CVES, Adobe Security Bulletins, Microsoft Security Bulletins and various blog posts. Total data has over 45,000 tokens and 5,000 tagged entities. Annotation was done by the Graduate Students of Computer Science Department who has good domain knowledge and understanding of the concepts. We used this dataset for training Stanford Named Entity Recognition (NER) to identify Cybersecurity entity and concept spotter which basically identifies entities (like Software Products and Operating Systems) and concepts (like denial of service, buffer overflow) from technical jargon. This data was tagged with the help of BRAT tool (http://brat.nlplab.org/index.html). Please find the read me file to read more about the function of each file and classes that are present in this.

Type: Dataset

Authors: Ravendar Lal

Date: May 30, 2013

Tags: cybersecurity, ner, natural language processing

Format: ZIP Compressed File (Need an extractor? Get one here)

Number of downloads: 156

Access Control: Publicly Available

 

Available for download as


size: 765467 bytes