University of Minnesota
University Relations
http://www.umn.edu/urelate
612-624-6868
myU OneStop


Go to unit's home.

Home | Seminars and Symposia | Past seminars/symposia: Wednesday, January 25, 2012

DTC Leading Edge Seminar Series

Learning Table Extraction on the Fly

by

Frank Schilder, Ravi Kondadadi
Thomson Reuters

Wednesday, January 25, 2012
3:30 p.m. reception
4:00 p.m. seminar

401/402 Walter Library

Tables are ubiquitous in various types of texts. News messages report on statistics about economical data in tabular form, scientists present their results in tables, and court clerks list participants in a legal case as a list. Standard Natural Language Processing (NLP) tools, however, assume a narrative structure for the extraction of company and person names and their relations among them and will produce poor results. We at Thomson Reuters R&D have developed a novel table extraction approach. In contrast to previous approaches to table extraction that utilize spatial reasoning over the positional information of the table cells and headers, we do not assume tables to be encoded in HTML or even have perfectly aligned columns or rows. Given that tables are often copied from a structured environment such as web pages and spread sheets into text where formatting is not maintained correctly, we propose a parsing technique that uses two simple parsing heuristics. Generally, tables can be difficult to parse because of the different ways information can be encoded in tables. Our approach starts with finding the data cells (i.e., bid/ask prices) in trader emails and pulls out all tokens associated with the respective price. Basically, the approach "flattens" the table by pulling out sequences of tokens that have scope over a data cell. We also propose a clustering and classifying method for finding prices reliably in the data set we used. We will conclude the talk with a discussion on how our proposed method is transferable to other data cell types and can be applied to other table content.

 

Schilder

Frank Schilder

FRANK SCHILDER is a research manager at the Research & Development department of Thomson Reuters. He joined Thomson Reuters in 2004, where he has been doing applied research on summarization technologies and information extraction systems. His summarization work has been implemented as the snippet generator for search results of WestLawNext, the new legal research system produced by Thomson Reuters. His current research activities involve participation in different research competitions such as the Text Analysis Conference (TAC) carried out by the National Institute of Standards and Technology (NIST). Dr Schilder obtained a Ph.D. in Cognitive Science from the University of Edinburgh, Scotland, in 1997. From 1997 to 2003, he was employed by the Department for Informatics at the University of Hamburg, Germany, first as a post-doctoral researcher and later as an assistant professor.

Kondadadi

Ravi Kondadadi

RAVI KONDADADI is a senior research scientist at the Thomson Reuters R&D department and holds a Master's degree in Computer Science from the University of Memphis. He primarily focuses his research in the areas of information extraction, text summarization and text generation. His current interests include the application of semisupervised learning approaches to information extraction problems.