Go to unit's home.

Home | DTI | 2007–08 funded proposals | Abhishek Chandra, Zhi-Li Zhang, Sambit Sahu, Kuai Xu

Initiatives in Digital Technology: 2007–08 Funded Proposals

Abhishek Chandra, Zhi-Li Zhang, Sambit Sahu, Kuai Xu

A Fault Inference and Troubleshooting Framework for Large-Scale Networked Systems

Large-scale networked systems are characterized by numerous faults that impact the performance and functionality of these systems in highly unanticipated ways. Such faults can occur at both the network and node-level, and happen due to several reasons: malicious code, misconfigurations, resource overloads, etc. However, finding the cause for such failures is extremely difficult due to the complexity and scale of such systems. While there is a large amount of monitoring data available in such systems, which provides detailed raw statistics about the network and node-level events, the sheer volume and diversity of these raw data make it extremely difficult to be analyzed and used by human operators for fault diagnosis and trouble-shooting. In this project, we intend to develop a general and systematic framework for building a knowledge extraction engine for the purpose of health monitoring, problem diagnosis and trouble-shooting of large-scale networked systems. As part of this framework, we intend to develop various advanced data analysis and inference techniques for fault classification and diagnosis, root cause analysis, online failure detection, as well as failure prediction based on the available monitored data. The goal is to automate the process of monitoring and trouble-shooting large-scale networked systems so as to quickly and scalably detect fault causes, and also to proactively prevent such faults from happening again. This project will involve collaboration with researchers at IBM and Yahoo!.