April 1, 2018
TAR.GZIP Compressed - Need an extractor? Get one here
This is a word embedding model produced by Dr. Youngja Park of IBM Research using word2vec applied to a collection of one million documents found on the Web relevant to cybersecurity. Tokenization was done using whitespace, resulting in 917,213,530 tokens of which 6,417,554 were unique. The word2 vec model has 100 dimensions and a vocabulary of 1,013,092 terms. If you use this model in your research, please cite this document to refer to the model.
Ankur Padia, Arpita Roy, Taneeya Satyapanich, Francis Ferraro, Shimei Pan, Youngja Park, Anupam Joshi and Tim Finin, UMBC at SemEval-2018 Task 8: Understanding Text about Malware, 12th International Workshop on Semantic Evaluation, co-located with NAACL HLT 2018, June 2018, New Orleans, LA, USA.