Speakers | |
---|---|
Frank Scholten | |
Schedule | |
Day | Saturday |
Room | AW1.124 |
Capacity | 59 |
Start time | 13:45 |
End time | 14:15 |
Duration | 00:30 |
Info | |
Track | Data Analytics devroom |
Introduction to Clustering with Mahout
Analyze and understand large corpora of text data using Apache Hadoop for distributed computation and Mahout as distributed machine learning toolkit.
Clustering is a popular technique to analyze and understand large corpora and is a key feature of for instance Google News text. Google News automatically clusters news articles in distinct clusters so visitors can quickly find what they're looking for. This technique is also one of the key features in Apache Mahout, an Open Source framework for scalable data analysis intended to run on Hadoop. This talk introduces you to clustering, how it's implemented in Mahout and it will show you step-by-step, how to cluster text documents using Mahout's command line interface. Additionally, this talk explains how to tweak the clustering process and how this affects the generated set of clusters. This will be a beginner talk to introduce people to clustering in general and Mahout in particular.
Concurrent events:
Next (up to 3) talks in the same room (AW1.124):
When | Event | Track |
---|---|---|
14:15-14:45 | Mapping Wikileaks' Cablegate using Python, mongoDB, Neo4J and Gephi | Data Analytics |
14:45-15:00 | Tools and Methods for Web Data Extraction | Data Analytics |
15:00-15:15 | Datalift, A catalyser for the Web of data | Data Analytics |