University of Minnesota
University Relations
myU OneStop

Go to unit's home.

Home | DTI | 2007–08 funded proposals | Vipin Kumar, Fred Kulack, Michael Steinbach, Richard Mushlin

Initiatives in Digital Technology: 2007–08 Funded Proposals

Vipin Kumar, Fred Kulack, Michael Steinbach, Richard Mushlin

Data Mining for Connecting Genomic Data and Disease

One of the important potential benefits of the genetic revolution is the possibility of personalized medicine, i.e., using detailed genomic information about a person for the detection, treatment, or prevention of disease. The recent availability of individual genomic information typically in the form of Single Nucleotide Polymorphisms (SNPs) offers one route for making this possibility a reality. In particular, the increasing availability of SNP data has created opportunities for discovering important connections between disease and genomic factors. Although there has been some success in finding such connections with currently available techniques, these approaches have a number of limitations and are most useful for finding connections involving only one or two SNPs. This proposal describes a program to develop and apply data mining techniques to find more general patterns that capture connections between SNPs and disease, including patterns that may involve a relatively large number of SNPs and patterns that show variation from patient to patient, either because of missing data or natural variation.

More specifically, SNP data can be represented as a binary data matrix and the task of finding the desired connections can be cast as a problem of finding patterns in such matrices. The area of data mining known as association analysis has extensively investigated approaches for addressing similar tasks. We will build on our and others work in association analysis, and the experience of our IBM Research collaborator in SNP data analysis, to evaluate the usefulness of association analysis techniques for finding SNP patterns related to disease. Because of the potential size of the data, the computationally demanding nature of the task, and the need for quick analysis turnaround time, we will create parallel versions of the most promising algorithms for the IBM Blue Gene using facilities provided by IBM Rochester.