Information Extraction of Security related entities and concepts from unstructured text.

Cyber Security has been a big concern especially in past one decade where it is witnessed that targets ranging from large number of internet users to government agencies are being attacked because of vulnerabilities present in the system. Even though these vulnerabilities are identified and published publicly but response has always been slow in covering up these vulnerabilities because there is no automatic mechanism to understand and process this unstructured text that is published on internet. Our system will be tackling this problem of processing unstructured text by identifying the security related terms including entities and concepts from various unstructured data sources. This information extraction task will help expediting the process of understanding and realizing the vulnerabilities and thus making systems secure at faster rate. This work will be describing a system that automatically extracts the terms from Cybersecurity blogs and security bulletins using Natural Language Processing (NLP) and text mining methods. Our NLP model is trained on manually annotated data using open-ended blogs and more structured text like company’s official security bulletins. This manually annotated data is unique of its kind since no such previous work has been done using this methodology and it can be a significant contribution for people working in this domain. Our named entity recognition model is trained on conditional random fields (CRFs) based Stanford NER. This automation system will be able to help administrators of organizations and governments to prioritize the task of beefing up their security and moreover track unofficial data sources like chat rooms and twitter for zero day attacks.


  • 748778 bytes

MastersThesis

University of Maryland Baltimore County

Ebiquity Lab

Downloads: 2380 downloads

UMBC ebiquity