Date of Degree
PhD (Doctor of Philosophy)
The availability of large-scale data on the Web motivates the development of automatic algorithms to analyze topics and to identify relationships between topics. Various approaches have been proposed in the literature. Most focus on specific topics, mainly those representing people, with little attention to topics of other kinds. They are also less flexible in how they represent topics.
In this thesis we study existing methods as well as describe a different approach, based on profiles, for representing topics. A Topic Profile is analogous to a synopsis of a topic and consists of different types of features. Profiles are flexible to allow different combinations of features to be emphasized and are extensible to support new features to be incorporated without having to change the underlying logic.
More generally, topic profiles provide an abstract framework that can be used to create different types of concrete representations for topics. Different options regarding the number of documents considered for a topic or types of features extracted can be decided based on requirements of the problem as well as the characteristics of the data. Topic profiles also provide a framework to explore relationships between topics.
We compare different methods for building profiles and evaluate them in terms of their information content and their ability to predict relationships between topics. We contribute new methods in term weighting and for identifying relevant text segments in web documents.
In this thesis, we present an application of our profile-based approach to explore social networks of US senators generated from web data and compare with networks generated from voting data. We consider both general networks as well as issue-specific networks. We also apply topic profiles for identifying and ranking experts given topics of interest, as part of the 2007 TREC Expert Search task.
Overall, our results show that topic profiles provide a strong foundation for exploring different topics and for mining relationships between topics using web data. Our approach can be applied to a wide range of web knowledge discovery problems, in contrast to existing approaches that are mostly designed for specific problems.
Text Mining;Web Mining;Knowledge Discovery;Topic Profile
xiii, 140 pages
Copyright 2007 Aditya Kumar Sehgal