![]() |
SummerFest '09 Summer School: An Introduction to Information ExtractionCourse DescriptionVast amounts of human knowledge are stored within documents written in natural languages. This information is difficult to access using automatic methods and is generally only accessible to humans. Information Extraction (IE) is an important language technology which attempts to solve this problem by identifying information within textual documents and converting it to a machine-readable format. IE has been applied to a wide range of problems including the identification of disease outbreaks from heath reports, terrorism events from news reports, executive movements within companies from business news stories and interactions between genes and proteins from scientific journals. Uses of IE include intelligence, drug discovery, text mining and knowledge discovery. The course will provide an accessible introduction to the problem of IE and its applications. A range of approaches will be described with the main focus on techniques that make use of machine learning to reduce the effort and expert knowledge required to create IE system. The course will include relevant background information including approaches to the evaluation of IE systems and the language processing techniques they use. Course Outline(1) Overview of Information Extraction The course will begin with a description of the IE problem. Examples of what might be expected when an IE system is applied to different extraction tasks are provided. The main components of IE systems are described. The two major approaches to creating IE systems (knowledge engineering and machine learning based) are presented. This section concludes with a brief example of how a knowledge engineering based IE system could be created. (2) Evaluation The next section describes the most important aspects of IE system evaluation. Topics covered include evaluation by comparison against a manual gold standard, the Message Understanding Conferences (including a description of the template stuctures used), the evaluation metrics commonly used to compare performance of different systems (Precision, Recall and F-measure) and system performance in standard evaluations. (3) Learning IE systems The final section focuses on the use of Machine Learning to automatically create and adapt IE systems. It begins by reviewing the advantages of this approach, namely to avoid the need for expert domain knowledge since this is often difficult or impossible to obtain. A range of approaches are described including two influential systems: AutoSlog (Riloff, 1993) and ExDisco (Yangarber et. al., 2000). AutoSlog examines filled templates and the documents they came from to generate a set of IE patterns that are then filtered by a human expert. The ExDisco system uses a different approach and only needs two sets of documents: one containing information relevant to the IE problem and another that does not. This approach does not require filled templates to generate a set of IE rules. Advantages and disadvantages of these system are discussed. Some other machine learning approaches to IE that extend these systems are also briefly discussed. The course concludes with discussion of some of the open problems within IE. ReferencesRiloff, E. (1993) "Automatically Constructing a Dictionary for Information Extraction Tasks", Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93). Yangarber, R., Grishman, R., Tapanainen, P. and Huttunen, S. "Unsupervised Discovery of Scenario-Level Patterns for Information Extraction" Proceedings of Conference on Applied Natural Language Processing (ANLP-NAACL 2000). Course RequirementsNo prior knowledge will be necessary although some background in Natural Language Processing and Machine Learning would be useful. PresenterDr Mark Stevenson Presenter Biography
Mark Stevenson is an EPSRC Advanced Research Fellow (2006 - 2011) and lecturer in Sheffield University's Natural Language Processing group where he has worked since 1995. His research interests include word sense disambiguation, lexical semantics, information extraction, information retrieval and the processing of text in the biomedical domain. He completed his PhD in 1999 with a thesis on the application of a range of linguistic knowledge sources for word sense disambiguation. In 2000 he was a short term research fellow in British Telecom's research labs, Adastral Park. From 2001 to 2002 he worked for for Reuters where he was involved in the production and dissemination of the Reuters Corpus. He was chosen as the first Reuters Foundation Digital Vision Fellow and spent October 2001 to June 2002 at the Center for the Study of Language and Information, Stanford University. |