University of Minnesota
University Relations
myU OneStop

Go to unit's home.

Home | Seminars and Symposia | Past seminars/symposia: Tuesday, November 16, 2010

DTC Leading Edge Seminar Series

Named Entity Extraction in the Real World: More than Just High Accuracy


Marc Light
Thomson Reuters

Tuesday, November 16, 2010
3:30 p.m. reception
4:00 p.m. seminar

401/402 Walter Library

Marc Light

Named Entity Extraction is the task of finding entities, such as people, locations, products and companies, in text. This task has been around at least since 1995 when a number of systems were formally evaluated as part of DARPA's sixth Message Understanding Conference. Named Entity Extraction systems are usually part of a larger system: for example, a system might scan a newspaper article looking for person names and then link these names to landing pages containing information relevant to the reader. The task is non-trivial due to the fact that many names, such as my last name, have other uses. Luckily, human languages provide many disambiguating cues; named entity systems try to discover, encode, and weight these cues optimally. More specifically, the task can be formalized as a sequence labeling task and statistical models can be employed to perform it. I will present one such model. Given a moderate amount of suitable training data, such models perform with high accuracy (mid 90s) and thus the task would seem to be solved. And that was my opinion when starting at Thomson Reuters. However, after building a system and fielding it as part of three applications sold to paying customers, I am of a different opinion. There are a number of system characteristics that are as important as accuracy. A short list would contain the ability to provide confidence rating for each entity found, to trade off precision and recall at run time, to provide salience rating for each entity, and to eliminate certain specific errors. I will describe how we added these capabilities among others to our system. In addition, I will describe a number of additional characteristics that we are currently working on. [This work was done jointly with Terry Heinze, Harsha Veeramachaneni, and Ramdev Wudali.]


Marc Light is a Lead Scientist in the Thomson Reuters R&D department. Since joining Thomson Reuters in 2006, he has led a number of information extraction and search projects that have enabled a variety of products. He has also stayed active in the research community organizing a workshop on software engineering for human language technology (SETQA-NLP 2009), publishing a number of workshop papers, and reviewing articles for NAACL, BioNLP, and BMC Bioinformations. Prior to joining Thomson Reuters, Marc held positions at the University of Iowa, The MITRE Corporation, Stuttgart University, and the University of Tuebingen. He earned a PhD in Computer Science from the University of Rochester and did his undergraduate work at MIT in Cognitive Science.