![]() |
SummerFest '09 Summer School: An Introduction to the Inner Workings of Information Retrieval SystemsCourse DescriptionMost researchers have some idea of how a search engine works: documents are gathered, a back of the book index is created, which is used to facilitate the fast search of a user’s query, but what is the detail that lies behind? This seminar will provide an introduction, covering not only indexing and searching document collections, but also the complex tasks of gathering documents from the web and other sources. It is aimed at researchers from computer science or information science who are interested to know more about search beyond the more commonly talked about topics of how to rank documents relative to a user’s query. More specifically the seminar will cover. How to collect documents from the Web; while this might seem straightforward, the seminar will show why it is that a large chunk of the web cannot be found through crawling. Issues such as scanning for updates and accessing parts of the deep web will also be covered. Processing and normalizing gathered documents will be shown to be a complex messy job, there are hundreds of document and character formats abound; worse, their definitions are often ignored or sometimes deliberately broken. The challenge of building a search engine increases when different languages are included in a collection as many add their own distinct problems and this important aspect will also be covered in the seminar. Means of effectively indexing large document collections will be the core part of the seminar. The research in this area will be described at a level that is both accessible to non programmers and covers sufficient detail to engage computer scientists who are new to this topic. This part will progressively introduce increasingly sophisticated indexing and retrieval techniques to show how the implementation of searching systems has adapted to growing collection sizes. Here methods such as document at a time searching, ordering of index structures by query independent features, use of compression, and parallelizing search will be described. Means of creating indexing structures to allow efficient handling of updates to the collection will also be described. Finally a brief overview of some of the public domain searching systems researchers commonly use will also be provided. Not only will the well known research platforms – such as Lemur, Terrier, etc – be described, means of implementing ranked retrieval on database systems such as Oracle or MySQL – will be detailed. At the end of this seminar, you will have a strong understanding of the broader issues that affect the implementation of searching systems, as well as have a set of references to the important papers and books that you can use to learn more about this often overlooked topic. PresenterDr Mark Sanderson Presenter Biography
Mark Sanderson is a Reader at the University of Sheffield in the Information Studies Department. He is a researcher in information retrieval (IR). He has a strong interest in the evaluation of search engines, but also works in geographic search, cross language IR (CLIR), summarization, image retrieval by captions, word sense ambiguity and has also built searching systems. Mark is on the editorial board of 4 of the leading IR journals and this year was co-PC chair of ACM SIGIR 2009. He is currently an investigator on three active research projects. He teaches courses in Web search, introduction to IR as well as contributing to courses in multimedia, essay writing and information security. |