]>
Recently NIH has done a number of Gene Wide Association Studies (GWAS) that resulted in massive datasets that contain subjects’ generic makeup and labeled with clinical data including occurrence of chronic diseases. Unfortunately, given relatively small number of patients in such studies and the vast number of genes possessed by human beings, these datasets could not be analyzed with traditional statistical predictive models.
The traditional models require large number of samples (patients) with a very few features per sample. My work attempts to solve this problem by employing state of the art Machine Learning techniques. In the past year I have built a software system that is capable of crunching of multi-terabyte scale datasets to refactor the NIH data into the form that is palatable by modern Big Data systems. I have run initial stages of feature selection. I will present the current state of the work and future plans. Another goal of this work is to ensure the repeatability of the experiments and flexibility to run with any similar dataset from current and future studies.
]]>