Dynamic Data Assimilation for Topic Modeling (DDATM)

Understanding how a particular discipline such as climate science evolves over time has received renewed interest. By understanding this evolution, predicting the future direction of the discipline becomes more achievable. Dynamic Topic Modeling (DTM) has been applied to a number of disciplines to model topic evolution as a means to learn how a particular scientific discipline and its underlying concepts are changing. Understanding how a discipline evolves, and its internal and external influences, can be complicated by how the information retrieved over time is integrated. There are different techniques used to integrate sources of information, however, less research has been dedicated to understanding how to integrate these sources over time. Data assimilation is commonly used in a number of scientific disciplines to both understand and make predictions of various phenomena, using numerical models and assimilated observational data over time.

This dissertation introduces a novel algorithm for scientific data assimilation, called Dynamic Data Assimilation for Topic Modeling (DDATM), which uses a new crossdomain divergence method and DTM. By using DDATM, observational data, in the form of full-text research papers, can be assimilated over time starting from an initial model. DDATM can be used as a way to integrate data from multiple sources and, due to its robustness, can exploit the assimilating observational information to better tolerate missing model information. When compared with a model built using DTM, the DDATM model produces topics with better characteristics, particularly for scientific data. The DDATM method is suitable for prediction and results in higher likelihood for subsequent documents. DDATM is able to overcome missing information during the assimilation process when compared with a DTM model. The cross-domain divergence method generalizes as a method that can also bring together multiple disciplines into one cohesive model enabling the identification of related concepts and documents across disciplines and time periods. Finally, grounding the topic modeling process with an ontology improves the quality of the topics for scientific data and enables a more refined understanding of concept relatedness and cross-domain influence.

The results of this dissertation are demonstrated and evaluated by applying DDATM to 30 years of reports from the Intergovernmental Panel on Climate Change (IPCC) along with more than 150,000 documents that they cite to show the evolution of the physical basis of climate change.

climate change, learning, natural language processing, topic modelling

PhdThesis

University of Maryland, Baltimore County

UMBC ebiquity