UMBC ebiquity
Machine Learning

Archive for the 'Machine Learning' Category

Clare Grasso: Information Extraction from Dirty Notes for Clinical Decision Support

May 11th, 2015, by Tim Finin, posted in Machine Learning, NLP, Ontologies, Semantic Web

Information Extraction from Dirty Notes
for Clinical Decision Support

Clare Grasso

10:00am Tuesday, 12 May 2015, ITE346

The term clinical decision support refers broadly to providing clinicians or patients with computer-generated clinical knowledge and patient-related information, intelligently filtered or presented at appropriate times, to enhance patient care. It is estimated that at least 50% of the clinical information describing a patient’s current condition and stage of therapy resides in the free-form text portions of the Electronic Health Record (EHR). Both linguistic and statistical natural language processing (NLP) models assume the presence of a formal underlying grammar in the text. Yet, clinical notes are often times filled with overloaded and nonstandard abbreviations, sentence fragments, and creative punctuation that make it difficult for grammar-based NLP systems to work effectively. This research focuses on investigating scalable machine learning and semantic techniques that do not rely on an underlying grammar to extract medical concepts in the text in order to apply them in CDS on commodity hardware and software systems. Additionally, by packaging the extracted data within a semantic knowledge representation, the facts can be combined with other semantically encoded facts and reasoned over to help to inform clinicians in their decision making.

Mid-Atlantic Student Colloquium on Speech, Language & Learning, Fri. 1/30

January 25th, 2015, by Tim Finin, posted in Machine Learning, NLP

The fourth Mid-Atlantic Student Colloquium on Speech, Language and Learning (MASC-SLL) will he held at JHU this coming Friday, January 30. It’s a good opportunity to sample current research on language technology and machine learning, including the work of a number of UMBC students. The program for the one-day colloquium includes oral presentations, poster sessions, a panel and three breakout sessions.

The event is free and open to all, but registration is requested by Tuesday, January 27. Note that the location has been moved to the Glass Pavilion on the JHU Homewood Campus

Facebook releases GPU-optimized deep learning tools

January 17th, 2015, by Tim Finin, posted in AI, High performance computing, Machine Learning

Facebook’s AI Research (FAIR) group has released open-source, optimized deep-learning modules for their open sourced Torch development environment for numerics, machine learning, and computer vision, with a particular emphasis on deep learning and convolutional nets.

The release includes GPU-optimized modules for large convolutional nets and networks with sparse activations that are commonly used in NLP applications.

See fbcunn for installation instructions, documentation and examples to train classifiers and iTorch for an IPython Kernel for Torch.

PhD defense: Varish Mulwad — Inferring the Semantics of Tables

December 29th, 2014, by Tim Finin, posted in KR, Machine Learning, NLP, Ontologies, Semantic Web


Dissertation Defense

TABEL — A Domain Independent and Extensible Framework
for Inferring the Semantics of Tables

Varish Vyankatesh Mulwad

8:00am Thursday, 8 January 2015, ITE325b

Tables are an integral part of documents, reports and Web pages in many scientific and technical domains, compactly encoding important information that can be difficult to express in text. Table-like structures outside documents, such as spreadsheets, CSV files, log files and databases, are widely used to represent and share information. However, tables remain beyond the scope of regular text processing systems which often treat them like free text.

This dissertation presents TABEL — a domain independent and extensible framework to infer the semantics of tables and represent them as RDF Linked Data. TABEL captures the intended meaning of a table by mapping header cells to classes, data cell values to existing entities and pair of columns to relations from an given ontology and knowledge base. The core of the framework consists of a module that represents a table as a graphical model to jointly infer the semantics of headers, data cells and relation between headers. We also introduce a novel Semantic Message Passing scheme, which incorporates semantics into message passing, to perform joint inference over the probabilistic graphical model. We also develop and explore a “human-in-the-loop” paradigm, presenting plausible models of user interaction with our framework and its impact on the quality of inferred semantics.

We present techniques that are both extensible and domain agnostic. Our framework supports the addition of preprocessing modules without affecting existing ones, making TABEL extensible. It also allows background knowledge bases to be adapted and changed based on the domains of the tables, thus making it domain independent. We demonstrate the extensibility and domain independence of our techniques by developing an application of TABEL in the healthcare domain. We develop a proof of concept for an application to generate meta-analysis reports automatically, which is built on top of the semantics inferred from tables found in medical literature.

A thorough evaluation with experiments over dataset of tables from the Web and medical research reports presents promising results.

Committee: Drs. Tim Finin (chair), Tim Oates, Anupam Joshi, Yun Peng, Indrajit Bhattacharya (IBM Research) and L. V. Subramaniam (IBM Research)

Taming Wild Big Data

September 17th, 2014, by Tim Finin, posted in Database, Datamining, Machine Learning, RDF, Semantic Web

Jennifer Sleeman and Tim Finin, Taming Wild Big Data, AAAI Fall Symposium on Natural Language Access to Big Data, Nov. 2014.

Wild Big Data is data that is hard to extract, understand, and use due to its heterogeneous nature and volume. It typically comes without a schema, is obtained from multiple sources and provides a challenge for information extraction and integration. We describe a way to subduing Wild Big Data that uses techniques and resources that are popular for processing natural language text. The approach is applicable to data that is presented as a graph of objects and relations between them and to tabular data that can be transformed into such a graph. We start by applying topic models to contextualize the data and then use the results to identify the potential types of the graph’s nodes by mapping them to known types found in large open ontologies such as Freebase, and DBpedia. The results allow us to assemble coarse clusters of objects that can then be used to interpret the link and perform entity disambiguation and record linking.

Rapalytics! Where Rap Meets Data Science

September 14th, 2014, by Tim Finin, posted in Machine Learning, NLP

UMBC Ebiquity Research Meeting

Rapalytics! Where Rap Meets Data Science

Abhay Kashyap

10:00am Wednesday, Sept. 17, 2014, ITE 346

For the Hip-Hop Fans: Remember the times when you had those long arguments with your friends about who the better rapper is? Remember how it always ended up in a stalemate because there was no evidence to back your argument? Well, look no further! Rapalytics is a one-stop site dedicated to extracting and presenting all the important analytics from Rap lyrics that separate a good rapper from a great one!

For the Data Science Nerds: Remember how indestructible your trained NLP tools were? Want to see how they act under pressure from text they have never seen before? Come take a look at how traditional NLP tools fair against text as complex as Rap and explore opportunities to design and build systems that handle much more than well-formed English text.

Free copy of Mining Massive Datasets

January 18th, 2014, by Tim Finin, posted in Big data, Datamining, Machine Learning, Semantic Web

A free PDF version of the new second edition of Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffey Ullman is available. New chapters on mining large graphs, dimensionality reduction, and machine learning have been added. Related material from Professor Leskovec’s recent Stanford course on Mining Massive Data Sets is also available.

Google knowledge data releases

December 4th, 2013, by Tim Finin, posted in Google, Machine Learning, NLP

A post on Google’s research blog lists the major datasets for NLP and KB processing that Google has released in the past year. They include datasets to help in entity linking, relation extraction, concept spotting and syntactic analysis. Subscribe to the the Knowledge Data Releases mailing list for updates.

Mid-Atlantic Student Colloquium on Speech, Language and Learning, 2013-10-11

August 2nd, 2013, by Tim Finin, posted in AI, Machine Learning, NLP

The third Mid-Atlantic Student Colloquium on Speech, Language and Learning will be held at UMBC on Fri. 11 Oct 3013, bringing together students, postdocs, faculty and researchers from universities in the Mid-Atlantic area doing research on speech, language or machine learning. It is an opportunity for students and postdocs to present preliminary, ongoing or completed work and to network with other researchers working in related fields.

The first MASC-SLL was held in 2011 at Johns Hopkins University and the second in 2012 at the University of Maryland, College Park. This year the event will be held at the University of Maryland, Baltimore County (UMBC) in Baltimore, MD from 9:30 to 5:00 on Friday, 11 October 2013. There will be no registration charge and lunch and refreshments will be provided.

Students and postdocs are encouraged to submit abstracts describing ongoing, planned, or completed research projects, including previously published results and negative results. Research in any field applying computational methods to any aspect of human language, including speech and learning, from all areas of computer science, linguistics, engineering, neuroscience, information science, and related fields, is welcome. All accepted submissions will be presented as posters and some will also be invited for short oral presentations. Student-led breakout sessions will also be held to discuss papers or topics of interest and stimulate interaction and discussion. Suggest breakout session topics via easychair.

UMBC WebBase corpus of 3B English words

May 1st, 2013, by Tim Finin, posted in Machine Learning, NLP, Semantic Web

The UMBC WebBase corpus is a dataset of high quality English paragraphs containing over three billion words derived from the Stanford WebBase project’s February 2007 Web crawl. Compressed, its size is about 13GB. We have found it useful for building statistical language models that characterize English text found on the Web.

The February 2007 Stanford WebBase crawl is one of their largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.

We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.

The corpus is available as a 13G compressed tar file which is about 48G when uncompressed. It contains 408 files with paragraphs extracted from web pages, one to a line with blank lines between them. A second set of 408 files have the same paragraphs, but with the words tagged with their part of speech (e.g., The_DT Option_NN draws_VBZ on_IN modules_NNS from_IN all_PDT the_DT).

The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.

Download the corpus from here.

Results of the 2013 Semantic Textual Similarity task

March 25th, 2013, by Tim Finin, posted in Machine Learning, NLP, Semantic Web

The results of the 2013 Semantic Textual Similarity task (STS) are out. We were happy to find that our system did very well on the core task, placing first out of the 35 participating teams. The three runs we submitted were ranked first, second and third in the overall summary score.

Congratulations are in order for Lushan Han and Abhay Kashyap, the two UMBC doctoral students whose research and hard work produced a very effective system.

The STS task

The STS core task is to take two sentences and to return a score between 0 and 5 representing how similar the sentences are, with a larger number meaning a higher similarity. Compared with word similarity, the definition of sentence similarity tends to be more difficult and different people may have different views.

The STS task provides a reasonable and interesting definition. More importantly, the Pearson correlation scores are about 0.90 [1] for human raters using Amazon Mechanical Turk on the 2012 STS gold standard datasets, almost same to inter-rater agreement level, 0.9026 [2], on the well-known Miller-Charles word similarity dataset. This shows that human raters largely agree on the definitions used in the scale.

  • 5: The sentences are completely equivalent, as they mean the same thing, e.g., “The bird is bathing in the sink” and “Birdie is washing itself in the water basin”.
  • 4: The sentences are mostly equivalent, but some unimportant details differ, e.g., “In May 2010, the troops attempted to invade Kabul” and “The US army invaded Kabul on May 7th last year, 2010”.
  • 3: The sentences are roughly equivalent, but some important information differs/missing, e.g., “John said he is considered a witness but not a suspect.” and “‘He is not a suspect anymore.’ John said.”
  • 2: The sentences are not equivalent, but share some details, e.g., “They flew out of the nest in groups” and “They flew into the nest together”.
  • 1: The sentences are not equivalent, but are on the same topic, e.g., “The woman is playing the violin” and “The young lady enjoys listening to the guitar”.
  • 0: The sentences are on different topics, e.g., “John went horse back riding at dawn with a whole group of friends” and “Sunrise at dawn is a magnificent view to take in if you wake up early enough for it”.

The STS datasets

There were 86 runs submitted from more than 35 teams. Each team could submit up to three runs over sentence pairs drawn from four datasets, which included the following.

  • Headlines (750 pairs): a collection of pairs of headlines mined from several news sources by European Media Monitor using the RSS feed, e.g., “Syrian rebels move command from Turkey to Syria” and “Free Syrian Army moves headquarters from Turkey to Syria”.
  • SMT (750 pairs): a collection with sentence pairs the DARPA GALE program, where one sentence is the output of a machine translation system and the other is a reference translation provided by a human, for example, “The statement, which appeared on a website used by Islamists, said that Al-Qaeda fighters in Islamic Maghreb had attacked three army centers in the town of Yakouren in Tizi-Ouzo” and the sentence “the pronouncement released that the mujaheddin of al qaeda in islamic maghreb countries attacked 3 stations of the apostates in city of aekorn in tizi ouzou , which was posted upon the web page used by islamists”.
  • OnWN (561 pairs): a collection of sentence pairs describing word senses, one from OntoNotes and another from WordNet, e.g., “the act of advocating or promoting something” and “the act of choosing or selecting”.
  • FNWN (189 pairs): a collection of pairs of sentences describing word senses, one from FrameNet and another from WordNet, for example: “there exist a number of different possible events that may happen in the future. in most cases, there is an agent involved who has to consider which of the possible events will or should occur. a salient_entity which is deeply involved in the event may also be mentioned” and “doing as one pleases or chooses;”.

Our three systems

We used a different system for each of our allowed runs, PairingWords, Galactus and Saiyan. While they shared a lot of the same infrastructure, each used a different mix of ideas and features.

  • ParingWords was built using hybrid word similarity features derived from LSA and WordNet. It used a simple algorithm to pair words/phrases in two sentences and compute the average of word similarity of the resulting pairs. It imposes penalties on words that are not matched with the words weighted by their PoS and log frequency. No training data is used. An online demonstration system is available to experiment with the underlying word similarity model used by this approach.
  • Galactus used unigrams, bigrams, trigrams and skip bigrams derived from the two sentences and paired them with the highest similarity based on exact string match, corpus and Wordnet based similarity metrics. These, along with contrast scores derived from antonym pairs, were used as features to train a support vector regression model to predict the similarity scores.
  • Saiyan was a fine tuned version of galactus which used domain specific features and training data to train a support vector regression model to predict the similarity scores. (Scores for FNWN was directly used from the PairingWords run.)

The results

Here’s how our three runs ranked (out of 86) on each of the four different data sets and on the overall task (mean).

  our three systems
dataset PairingWords Galactus Saiyan
Headlines 3 7 1
OnWN glosses 4 11 35
FNWN glosses 1 3 2
SMT 8 11 16
mean 1 2 3

Over the next two weeks we will write a short system paper for the *SEM 2013, the Second Joint Conference on Lexical and Computational Semantics.


[1] Eneko Agirre, Daniel Cer, Mona Diab and Gonzalez-Agirre Aitor. 2012. SemEval-2012 task 6: A pilot on semantic textual similarity. In Proc. 6th Int. Workshop on Semantic Evaluation (SemEval 2012), in conjunction with the First Joint Conf. on Lexical and Computational Semantics (*SEM 2012)., Montreal,Canada.

[2] P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. 14th Int. Joint Conf. on Artificial Intelligence, 1995.

Google Wikilinks corpus

March 8th, 2013, by Tim Finin, posted in Google, Machine Learning, NLP, Semantic Web

Google released the Wikilinks Corpus, a collection of 40M disambiguated mentions from 10M web pages to 3M Wikipedia pages. This data can be used to train systems that do entity linking and cross-document co-reference, problems that Google researchers attacked with an earlier version of this data (see Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models).

You can download the data as ten 175MB files from and some addional tools from UMASS.

This is yet another example of the important role that Wikipedia continues to play in building a common, machine useable semantic substrate for human conceptualizations.

You are currently browsing the archives for the Machine Learning category.

  Home | Archive | Login | Feed