Investigating the Usefulness of Markup-Based Knowledge Extraction


Web search algorithms have matured significantly over the last years and a query submitted through a standard search engine like Google usually returns excellent matches. However, there is also a growing number of electronic document collections in companies, universities and other institutions. Such collections typically cover a much smaller domain, but normally there is no explicit domain model describing the content and relations between documents. As with Web search, finding the right information can be easy, but it can also be difficult to either find any matching document at all or to select the most appropriate ones from a large set of potential matches. Nevertheless, in these "limited domains" it is feasible to exploit the documents' markup structure to automatically build a domain model. That model should be able to assist a user by refining the choices the user is offered or can make as he or she searches the document collection. This will appear to the user as a specialized dialogue with the system. The first part of the research will be the development of a dialogue model that applies the domain knowledge. The second part will be the evaluation of such a system to measure its usefulness.

Funding Body



June 2003 - September 2004

EPSRC's Overall Assessment

Tending to Outstanding

Principal Investigator

Dr Udo Kruschwitz

Senior Research Officers

Mrs Hala Al-Bakour

Dr Patrick Mills

Related Publications

Kruschwitz, U. and H. Al-Bakour "Users Want More Sophisticated Search Assistants - Results of a Task-Based Evaluation". Journal of the American Society for Information Science and Technology (JASIST), 56(13): 1377-1393, August 2005 (Preprint: Postscript version here / PDF version here)

Kruschwitz, U. "An Adaptable Search System for Collections of Partially Structured Documents". IEEE Intelligent Systems, 18(4): 44-52, July/August 2003.

Kruschwitz, U. "Automatically Acquired Domain Knowledge for ad hoc Search: Evaluation Results". In Proceedings of the 2003 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE'03), Beijing, 2003. IEEE. (PDF version here)

Online Demo Systems

Prototypes have been built for a variety of sample domains, including the Web site of the University of Essex and the BBC News Web site. Both search systems use Google's API as a backend search engine which is enriched by an automatically acquired domain model to assist a user in the search process. The online systems are password protected.

Search BBC News

Search Essex University


Udo Kruschwitz, e-mail:

© The Udo / last change: 4 October 2005