CERN Computing Seminar

Anomaly Detection using the "Isolation Forest" algorithm

by David Gerster (BigML Inc.)

Europe/Zurich
31/3-004 - IT Amphitheatre (CERN)

31/3-004 - IT Amphitheatre

CERN

105
Show room on map
Description

Anomaly detection can provide clues about an outlying minority class in your data: hackers in a set of network events, fraudsters in a set of credit card transactions, or exotic particles in a set of high-energy collisions. In this talk, we analyze a real dataset of breast tissue biopsies, with malignant results forming the minority class.

The "Isolation Forest" algorithm finds anomalies by deliberately “overfitting” models that memorize each data point. Since outliers have more empty space around them, they take fewer steps to memorize. Intuitively, a house in the country can be identified simply as “that house out by the farm”, while a house in the city needs a longer description like “that house in Brooklyn, near Prospect Park, on Union Street, between the firehouse and the library, not far from the French restaurant”.

We first use anomaly detection to find outliers in the biopsy data, then apply traditional predictive modeling to discover rules that separate anomalies from normal data. These rules provide surprisingly strong clues about which biopsies are malignant. Interestingly, anomaly detection continues to provide strong clues even when fitted to data with only benign biopsies.

About the speaker

David Gerster is Vice President of Data Science at BigML, where he promotes the idea that data science is easy by speaking at conferences and teaching. Since joining BigML in July 2013, he has spoken at Big Data Spain, Papis.io, DataLead (UC Berkeley), DataBeat (VentureBeat), and more than a dozen other venues. Recently he taught a twoday class at the Polytechnic University of Valencia that covered supervised and unsupervised learning.

At Groupon, he built an elite data science team that trained the first machinelearned models for mobile deal relevance. At Yahoo, he led the project to collect billions of URL clickstreams in Hadoop and use them to improve web search ranking, resulting in measurable improvements to Yahoo’s main web search algorithm. He holds an MBA from the University of California at Berkeley and a Bachelor’s degree from Harvard University.


Organised by: Matthias Braeger, GS Department and Miguel Angel Marquina
Computing Seminars /IT Department

Video in CDS