MS defense: Open Information Extraction for Code-Mixed Hindi-English Social Media Data

July 1st, 2018

MS Thesis Defense

Open Information Extraction for Code-Mixed Hindi-English Social Media Data

Mayur Pate

1:00pm Monday, 2 July 2018, ITE 325b, UMBC

Open domain relation extraction (Angeli, Premkumar, & Manning 2015) is a process of finding relation triples. While there are a number of available systems for open information extraction (Open IE) for a single language, traditional Open IE systems are not well suited to content that contains multiple languages in a single utterance. In this thesis, we have extended a existing code mix corpus (Das, Jamatia, & Gambck 2015) by finding and annotating relation triples in Open IE fashion. Using this newly annotated corpus, we have experimented with seq2seq neural network (Zhang, Duh, & Durme 2017) for finding the relationship triples. As prerequisite for relationship extraction pipeline, we have developed part-of-speech tagger and named entity and predicate recognizer for code-mix content. We have experimented with various approaches such as Conditional Random Fields (CRF), Average Perceptron and deep neural networks. According to our knowledge, this relationship extraction system is first ever contribution for any codemix natural language. We have achieved promising results for all of the components and it could be improved in future with more codemix data.

Committee: Drs. Frank Ferraro (Chair), Tim Finin, Hamed Pirsiavash, Bryan Wilkinson