Interview: Isabel Drost

Isabel Drost will give a talk about Apache Adoop at FOSDEM 2010.

Could you briefly introduce yourself?

I am Co-Founder of Apache Mahout, a project that is working on scalable machine learning implementations. Scalable here means, scalable in terms of:

community - that is gathering people interested in the topic to achieve a stable, lively community.
performance - the implementations need to be scalable, that is work on real-world dataset sizes. To achieve this goal, several algorithms have been implemented on top of Apache Hadoop.
use cases - the library is available under the Apache software license, that is a commercially friendly and liberal license.

In 2008 I started organizing the Apache Hadoop Get Together in Berlin and co-organized the first european NoSQL meetup in 2009.

What will your talk be about, exactly?

The talk will give an introduction to Apache Hadoop: Which use cases is it best suited for? How does it achieve high scalability while still being easy to use? What does a typical MapReduce program look like? Finally the talk will give a brief overview of the Hadoop ecosystem - that is, projects that either extend Hadoop, and make it easier to use or administrate. The talk will conclude with current and future developments of Hadoop.

What do you hope to accomplish by giving this talk ? What do you expect?

I would love to spread the knowledge on what Apache Hadoop is and what it can be used for. It would be nice, to learn on what people attending FOSDEM are doing with Apache Hadoop. Of course it would be great to meet current or future Apache Hadoop developers at Fosdem. If you are interested in an informal Hadoop meetup after the NoSQL devroom day - feel free to contact me and I am sure we can put something together that involves good Belgian beer.

What problems does Apache Hadoop solve and how does it do it, from a bird's-eye view?

Apache Hadoop makes it easy to setup a cluster of machines and solve data intensive tasks in parallel on commodity hardware. In its core it comes with two modules: HDFS is the distributed filesystem that is used to store the data on the cluster. As the probability of any single machine failing in a clusters increases with the number of machines in the cluster, HDFS comes with built-in support for replicating data within the cluster. The second module - the MapReduce engine - provides a way for writing parallel applications digesting the data stored in HDFS in a very easy way.

Hadoop is well suited for applications in the domain of text processing, data analysis and data mining - that is tasks, that involve large amounts of data you can go through fairly independently. Things like graphs and big social data-structures remain tricky.

What's the history of the Hadoop project? How did it evolve?

You are probably familiar with Lucene, a free software for indexing full text documents. Nutch, a subproject of Lucene has the goal of building an internet scale search engine. To achieve this goal, in 2004 Doug Cutting and Mike Cafarella started implementing HDFS and Map Reduce based on the publications on GFS and MapReduce by Google. This part of Nutch quickly gained momentum. It got separated out into the Hadoop subproject and received considerable support from Yahoo!. This company had decided that instead of reimplementing the algorithms or forking the project it would be most beneficial to use and improve Apache Hadoop and contribute improvements back to the project. Finally in 2008 the project became an Apache top-level-project on its own.

Most people know that Facebook and Yahoo! are using Apache Hadoop. Can you name a few other prominent users of Hadoop and why they are using it?

Since its early stages in 2004 it has developed into a stable, well-known tool in the large scale data analysis sector. Several users of Apache Hadoop have disclosed their usage on the PoweredBy wiki page of the project. There are companies like the New York Times, Last.fm, Powerset (acquired by Microsoft in 2009), Facebook, Yahoo!, Amazon, AOL to name just a few of the larger ones. There are also several universities using Apache Hadoop for research but also for teaching large scale parallel systems to students.

Can you describe some use cases for Hadoop? For which purposes is it suitable?

Last.fm uses it mostly for log analysis, the FOX audience network uses Apache Hadoop for log analysis and data mining, at AOL it is used for behavioural analysis and targeting. Hadoop is well suited for analysing large amounts of data, processing text.

Is Yahoo! still the largest contributor to the project? Which other companies are big contributors? And how big is the developer community?

Yahoo! did provide several of its developers with the freedom to work on Apache Hadoop and contribute their improvements back to the project. However the project always was also driven by indipendent developers. Currently there are people from Last.fm and Facebook contributing their work. In addition there are companies like Cloudera and 101tec who provide Apache Hadoop consultancy and as a result contribute their bug-fixes and improvments back to the project.

Currently there are more than 20 committers in the core team. However the number of contributors is considerably higher. Although Yahoo! provides lots of engineering support, the community welcomes people exploring new use cases and improvements. One area we love help on is the work scheduling problem, there's a plugin API to let people provide new schedulers. Yahoo! and Facebook have provided their schedulers, as have others, but as effective scheduling is a CS-hard Halting-Problem class of problem, there is much room for improvement.

Looking at the companies using Apache Hadoop, I see that most of them have tens or even hundreds of terabytes of storage. Is Apache Hadoop also relevant for companies that don't sit on terabytes of data?

It certainly is. In recent months developers at mid-sized companies have presented their use cases in Berlin. In quite a lot of cases teams are working with clusters of five to ten nodes. The reason for using such setups are the low entry barrier to setting up and using Apache Hadoop. In addition it is comparably easy to scale the clusters up in case the workload increases.

What do you think about HadoopDB? , the hybrid database based on PostgreSQL? and Hadoop that researchers at Yale University showed in the summer of 2009?

It looks like an interesting approach to combine the strenghts of Hadoop with some of the feature set of relational databases. I'd like to see more support for MR style jobs in databases, less criticism. Oracle published some information on this recently.

Which features can we expect in Hadoop in 2010?

A new version of Hadoop; 0.21 has taken a while to come out.
Some security, at least users and groups in the filesystem and job execution, though the network is still vulnerable. You will still need to secure your Hadoop cluster's network.
A lot of the innovation in Hadoop is now building on top of the base system: Mahout, Hama, HBase, Cassandra are all examples on this. We hope to see more of an evolution of an Apache Hadoop ecosystem, something for everyone's data centre, real or virtual, owned or outsourced.

Have you enjoyed previous FOSDEM editions?

I have attended FOSDEM since 2007 - but only as visitor, not as speaker.

This interview is licensed under a Creative Commons Attribution 2.0 Belgium License.

Speakers

fosdem.org

User login