Document Type


Date of Degree


Degree Name

PhD (Doctor of Philosophy)

Degree In

Computer Science

First Advisor

Eunjin Jung

First Committee Member

Padmini Srinivasan

Second Committee Member

Sriram Pemmaraju

Third Committee Member

Juan Pablo Hourcade

Fourth Committee Member

Brent Kang


This thesis explores the use of applied machine learning techniques to augment traditional methods of identifying and preventing web-based attacks. Several factors complicate the identification of web-based attacks. The first is the scale of the web. The amount of data on the web and the heterogeneous nature of this data complicate efforts to distinguish between benign sites and attack sites. Second, an attacker may duplicate their attack at multiple, unexpected locations (multiple URLs spread across different domains) with ease. Third, attacks can be hosted nearly anonymously; there is little cost or risk associated with hosting or publishing a web-based attack. In combination, these factors lead one to conclude that, currently, the webs threat landscape is unfavorably tilted towards the attacker.

To counter these advantages this thesis describes our novel solutions to web se- curity problems. The common theme running through our work is the demonstration that we can detect attacks missed by other security tools as well as detecting attacks sooner than other security responses. To illustrate this, we describe the development of BayeShield, a browser-based tool capable of successfully identifying phishing at- tacks in the wild. Progressing from specific to a more general approach, we next focus on the detection of obfuscated scripts (one of the most commonly used tools in web-based attacks). Finally, we present TopSpector, a system we've designed to forecast malicious activity prior to it's occurrence. We demonstrate that by mining Top-Level DNS data we can produce a candidate set of domains that contains up to 65% of domains that will be blacklisted. Furthermore, on average TopSpector flags malicious domains 32 days before they are blacklisted, allowing the security community ample time to investigate these domains before they host malicious activity.


applied machine learning, Computer security, Domain Name System, javascript, phishing, web-based attacks


xiii, 140 pages


Includes bibliographical references (pages 131-140).


Copyright 2011 Peter Likarish