University of Minnesota
University Relations
http://www.umn.edu/urelate
612-624-6868
myU OneStop


Go to unit's home.

Home | DTI | 2008–09 funded proposals | Abhishek Chandra, Zhi-Li Zhang, Greg Bronevetsky, Kevin Anderson

Initiatives in Digital Technology: 2008–09 Funded Proposals

Abhishek Chandra, Zhi-Li Zhang, Greg Bronevetsky, Kevin Anderson

A Holistic Approach to Fault Inference and Prediction in Distributed Systems

Distributed systems are characterized by numerous faults that impact the performance and functionality of these systems. Such faults can occur at different levels within the distributed system: at the network, node, or the application level. Moreover, due to the complex interactions between different components, many of these faults are related to each other, and faults in one location/component can affect other parts of the system in unanticipated ways. Such complex interactions make the identification, prediction, and localization of such faults extremely difficult. Often these distributed systems collect large amounts of monitoring data at different levels in the form of system, application, and service logs, that provide detailed raw statistics about events occuring at different components of the system. However, the sheer volume and diversity of these raw data and their apparent disconnect with each other makes it extremely difficult to be analyzed and used by human operators for fault diagnosis and trouble-shooting. In this project, we intend to take a holistic approach to fault inference and prediction in distributed systems, that would enable us to correlate these diverse data to get a more detailed and consistent view of the system. As part of our approach, we intend to develop various advanced data analysis and inference techniques for fault classification and diagnosis, root cause analysis, online failure detection, as well as failure prediction based on the available monitored data. The goal is to automate the process of monitoring and trouble-shooting large-scale networked systems so as to quickly and scalably detect fault causes, and also to proactively prevent such faults from happening again. This project will involve collaboration with researchers at Lawrence Livermore National Laboratory and Red Hat.