CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition

Casey Hanks; Michael Maiden

UMBC Undergraduate Research and Creative Achievement Day

CyberEnt: A Cybersecurity Domain Specific Dataset for Named Entity Recognition

April 18, 2022

Named Entity Recognition (NER) is a critical component of automated knowledge extraction. It allows Natural Language Processing (NLP) models to label instances of real-world entities that are important in the context of the text. To be able to accomplish this, the NLP model needs to be trained on large corpora of human-annotated text. There are examples of general, domain-agonistic text corpora available, but they are not suited for fields such as cybersecurity, that require domain-specific text for downstream tasks such as malware analysis. NLP for cybersecurity is an emerging field, and there is a large need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to extract meaningful insights from Cyber Threat Intelligence (CTI). There are terabytes of CTI data that are disclosed on a daily basis, making it nearly impossible for human-analysts to manually sift through. The cybersecurity domain has limited training datasets available, as opposed to other domains such as Medicine or Law. We have created a large CTI corpus and are actively using it to train and test supervised and semi-supervised cybersecurity NER models using the SpaCy NLP Framework. In addition, we also aim to develop methods that allow continuous integration of incoming, up-to-date CTI information.

View the presentation with a voicethread audio here.

This work was funded, in part, by a grant from the NSA through the On-Ramp program and by the National Science Foundation under Grant Number 2114892.

451058 bytes

BibTeX OWL Tweet Scholar

Tags: cybersecurity, entity extraction, natural language processing

Type: Misc

Downloads: 881 downloads