Go to unit's home.

Home | News and Announcements | Archive | Taking a byte out of mountains of data

March 24, 2006

The below news article orginally appeared on UMN news. Link to the orginal story is here original story

Taking a byte out of mountains of data

An open house at the Digital Technology Center took the first step in finding ways to 'mine' databases for nuggets of usable information

Database image By Deane Morrison

March 24, 2006

In many banks, sensors in the window glass pick up vibrations that signal a break-in. This is good--except during a thunderstorm, when vibrations cause the sensors to send a false alarm signal. The same thing may happen to security sensors in ATM machines that pick up vibrations caused by street cleaners. These examples, courtesy of professor Jaideep Svistava of the University's computer science and engineering department, show how difficult it can be to mine nuggets of meaning from the mountains of data being generated and stored in databases. On Thursday, the University's Digital Technology Center (DTC) brought together University and industry people in an open house to plan the first steps in overcoming the challenges of so-called "data mining."

Without data mining, medical scientists cannot identify genes among the countless DNA sequences in the human genome, businesses cannot identify customer buying patterns, and heart pacemakers could not pick out warning signals among the ocean of electrical waves emanating from the heart. Data mining is a new and imperfect science, and the open house made it clear that the University is out to help companies overcome the obstacles and improve their performance.

"We hope to have long-term strategic relationships with companies, not a one-shot deal." With the open house registration nearly a 50-50 split of University and industry representatives, prospects look good.

From its base on the fourth floor of the Walter Library on the Twin Cities campus, the DTC brings together researchers studying all things digital (except fingers). The open house attracted representatives of such companies as Target, Guidant, 3M, DuPont, and General Dynamics, along with University faculty in engineering, computer science, biostatistics, and business management. In highlighting the challenges of data mining, speakers pointed to several areas that may become focal points for University-industry research consortia.

"We may set up large groupings, for example a medical consortium and a business consortium, and then let them sort themselves out into more specific groups," said Jim Licari, industrial liaison for the DTC. "We may have one to study computer storage architecture, since we have four to six faculty with expertise in that area. We hope to have long-term strategic relationships with companies, not a one-shot deal." With the open house registration nearly a 50-50 split of University and industry representatives, prospects look good.

In his welcoming remarks, DTC director Andrew Odlyzko spoke of the center's commitment to solving real-world problems through interactions between the University and industry.

"We have a critical mass of faculty expertise," said Odlyzko, pointing to the 32 faculty (along with 33 staff and more than 200 students) with expertise in all kinds of digital technologies, who are affiliated with the center. They and their future industrial partners will have plenty of opportunities, for, as keynoter Usama Fayyad pointed out, they have their work cut out for them.

"It's a myth that data mining is pervasive and that people know how to use tools to mine it. The reality is that data is a shambles," said Fayyad, the chief data officer and senior vice president for research and strategic data solutions at Yahoo! Inc. "We know how to build massive data stores--I call them 'data tombs'--but we can't access [them]. And modern-day pharaohs run around saying, 'My data warehouse is bigger than your data warehouse.'"

A few examples from Fayyad's talk illustrate the power and limitations of current data mining, along with some of the pitfalls awaiting the unwary miner.

"People switching phone companies are a big problem. Companies want models to tell who's going to 'churn' (switch) on them," said Fayyad, who worked with companies on the problem before he joined Yahoo!. "We came up with a model to predict that [a certain person] will 'churn.' But then there's the question, 'Should we let him go or spend $100 to keep him?'"

For phone companies, the answer depends on a customer's "lifetime value" as a contributor to company profit. Computers can mine customer records to determine lifetime value, and they may find patterns in the data that a person would miss. Fayyad told how one group of customers appeared--to human observers--to have low lifetime value. But a computer judged them to have high value, and it was right: The customers were actually engineering students in the last two years of school, and as such were considered excellent prospects for contributing to company profits in the long run.

"The [computer] had keyed in on their behavior, such as calls late at night and on weekends," Fayyad explained.

An example from the world of online retailing showed why it can be dangerous to "follow the leader" in choosing ways to mine customer data to increase sales. If somebody is buying books on Amazon.com, the company will use purchase records from people similar to the customer to recommend more books.

But, said Fayyad, suppose a customer is buying a pair of pants. Using the Amazon.com model would be counterproductive from a business standpoint. Once a customer has selected a pair of pants, showing more styles of pants that may appeal to him or her could well cause the customer to have second thoughts about the purchase. Instead, the company should suggest a shirt to go with the pants, followed by an "impulse item" such as a belt.

A big goal in data mining is to find similar patterns among the mass of data. In medical data, similarities may allow one to identify patients with a certain genetic makeup who tend to react poorly to a drug. In business, one may be hunting for a group of customers who purchase one product, such as a backpack, who would be likely targets for ads about the company's line of hiking boots. But computers have to be programmed to take into account false similarities that would be immediately obvious to a live person.

"People buying Coke and people buying Pepsi may appear similar in that they both drink soft drinks," said Fayyad. "But those two groups don't overlap."