Date of Degree
PhD (Doctor of Philosophy)
William N. Street
Classification is a data mining problem that arises in many real-world applications. A popular approach to tackle these classification problems is using an ensemble of classifiers that combines the collective knowledge of several classifiers. Most popular methods create a static ensemble, in which a single ensemble is constructed or chosen from a pool of classifiers and used for all new data instances. Two factors that have been frequently used to construct a static ensemble are the accuracy of and diversity among the individual classifiers. There have been many studies investigating how these factors should be combined and how much diversity is required to increase the ensemble's performance. These results have concluded that it is not trivial to build a static ensemble that generalizes well. Recently, a different approach has been undertaken: dynamic ensemble construction. Using a different set of classifiers for each new data instance rather than a single static ensemble of classifiers may increase performance since the dynamic ensemble is not required to generalize across the feature space. Most studies on dynamic ensembles focus on classifiers' competency in the local region in which a new data instance resides or agreement among the classifiers. In this thesis, we propose several other approaches for dynamic class prediction.
Existing methods focus on assigned labels or their correctness. We hypothesize that using the class probability estimates returned by the classifiers can enhance our estimate of the competency of classifiers on the prediction. We focus on how to use class prediction probabilities (confidence) along with accuracy and diversity to create dynamic ensembles and analyze the contribution of confidence to the system. Our results show that confidence is a significant factor in the dynamic setting. However, it is still unclear how accurate, diverse, and confident ensemble can best be formed to increase the prediction capability of the system.
Second, we propose a system for dynamic ensemble classification based on a new distance measure to evaluate the distance between data instances. We first map data instances into a space defined by the class probability estimates from a pool of two-class classifiers. We dynamically select classifiers (features) and the k-nearest neighbors of a new instance by minimizing the distance between the neighbors and the new instance in a two-step framework. Results of our experiments show that our measure is effective for finding similar instances and our framework helps making more accurate predictions.
Classifiers' agreement in the region where a new data instance resides has been considered a major factor in dynamic ensembles. We postulate that the classifiers chosen for a dynamic ensemble should behave similarly in the region in which the new instance resides, but differently outside of this area. In other words, we hypothesize that high local accuracy, combined with high diversity in other regions, is desirable. To verify the validity of this hypothesis we propose two approaches. The first approach focuses on finding the k-nearest data instances to the new instance, which then defines a neighborhood, and maximizes simultaneously local accuracy and distant diversity, based on data instances outside of the neighborhood. The second method considers all data instances to be in the neighborhood, and assigns them weights depending on the distance to the new instance. We demonstrate through several experiments that weighted distant diversity and weighted local accuracy outperform all benchmark methods.
In classification problems, observations fall into preassigned groups. Examples include identifying customers who would buy a product, and detecting whether a credit card expense is made by a customer. A popular approach to tackle these problems is using a collection of models that combines the collective knowledge of them. It has been shown that employing multiple models outperforms a single model. A common approach has been to use the same collection for all observations, which is also known as the static approach. Recently, there have been more attempts in using a different collection that is more specialized for each observation, depending on the features of observations. This is referred to as the dynamic approach.
In this thesis, we adopt the dynamic approach and explore what sort of characteristics we would like our models or the collections to exhibit. Two factors have been used frequently in the literature: accuracy and diversity of models. The second factor is about how different models are in a collection. In addition to these two factors, we consider a third one: confidence of models. First, we investigate to what extent confidence can enhance the competency of models on the prediction. Second, we propose a new measure in the dynamic approach to evaluate the similarity between observations. We show that our measure is effective for finding similar observations and our framework helps making more accurate predictions. Finally, we return our attention to the diversity factor and analyze how diversity should be assessed in a dynamic setting.
publicabstract, Classifiers, Confidence, Data Mining, Diversity, Dynamic Class Prediction, Dynamic Ensembles
xv, 161 pages
Includes bibliographical references (pages 155-161).
Copyright 2015 Şenay Yaşar Sağlam