Dataset

Annotations of Cybersecurity blogs and articles

May 30, 2013

765467 bytes

ZIP Compressed File - Need an extractor? Get one here

Tags: cybersecurity, natural language processing, ner

This data is the result of a Master Thesis Project by Ravendar Lal under the supervision of Dr. Tim Finin. This dataset can be used for training technical systems. This dataset consists of manually data for cybersecurity domain where this data collection has the articles from CVES, Adobe Security Bulletins, Microsoft Security Bulletins and various blog posts. Total data has over 45,000 tokens and 5,000 tagged entities. Annotation was done by the Graduate Students of Computer Science Department who has good domain knowledge and understanding of the concepts. We used this dataset for training Stanford Named Entity Recognition (NER) to identify Cybersecurity entity and concept spotter which basically identifies entities (like Software Products and Operating Systems) and concepts (like denial of service, buffer overflow) from technical jargon. This data was tagged with the help of BRAT tool (http://brat.nlplab.org/index.html). Please find the read me file to read more about the function of each file and classes that are present in this.

Counter: 1394 downloads

Access Control: Public

OWL Tweet