Date of Degree
PhD (Doctor of Philosophy)
Along with the exponential growth of text data on the Web, particularly of the user-generated content, comes to an increasing need for hierarchically organizing documents, retrieving documents, and discovering evolutionary trends of various popular topics from the data. However, all of these are challenging due to the diversity, heterogeneity, noisiness and time-sensitivity of Web 2.0 data. Motivated by this, we tackle the challenges at a fundamental level, by proposing a novel topic modeling method with ontological guidance. It may be used to discover topic language models formalizing various terms relevant to given topics using the Web data. The topic model takes into account both the ontological relationships amongst the topics defined in a topic taxonomy and also word co-occurrence patterns in the data to automatically identify the portions in the data relevant to the topics. Then, it estimates language models for these topics from these relevant portions. At an application level, we use the topic model to propose novel approaches for three different tasks, namely hierarchical text classification without labeled data, information retrieval with pseudo-relevance feedback, and discovering topic evolutionary trends. Our classification experiment on IPTC (International Press and Telecommunications Council) taxonomy, containing more 1100 topics, shows that our approach achieves a performance of 67% in terms of the hierarchical version of the F-1 measure, without using any labeled data. Our retrieval experiments on five benchmark datasets show that compared to baseline retrieval (without pseudo-relevance feedback), our approach improves on average 39% in terms of mean average precision. Finally, for the last task, using blog data, our approach discovers meaningful insights on how the crowd responds to various news topics such as the language used to discuss each topic, how this language drifts over time, and when the crowd's focus on a topic increases, reaches a peak, and declines.
ix, 94 pages
Includes bibliographical references (pages 87-94).
Copyright 2011 Viet Thuc Ha