Frank Scholten
Day Saturday
Room AW1.124
Capacity 59
Start time 13:45
End time 14:15
Duration 00:30
Track Data Analytics devroom

Introduction to Clustering with Mahout

Analyze and understand large corpora of text data using Apache Hadoop for distributed computation and Mahout as distributed machine learning toolkit.

Clustering is a popular technique to analyze and understand large corpora and is a key feature of for instance Google News text. Google News automatically clusters news articles in distinct clusters so visitors can quickly find what they're looking for. This technique is also one of the key features in Apache Mahout, an Open Source framework for scalable data analysis intended to run on Hadoop. This talk introduces you to clustering, how it's implemented in Mahout and it will show you step-by-step, how to cluster text documents using Mahout's command line interface. Additionally, this talk explains how to tweak the clustering process and how this affects the generated set of clusters. This will be a beginner talk to introduce people to clustering in general and Mahout in particular.

