FOSDEM '08 is a free and non-commercial event organised by the community, for the community. Its goal is to provide Free and Open Source developers a place to meet.

   

Schedule: Kettle: extracting, transforming and loading data

Speakers
Matt Casters
Schedule
Day Saturday
Room Ferrer
Start time 16:40
End time 16:55
Duration 00:15
Info
Event type Lightning-Talk
Track Lightning Talks
Language English
Media
Slides (PDF)
Video (Ogg/Theora)
Kettle: extracting, transforming and loading data

With the lightning talk, Matt wants to take a first stab at bridging the gap between two worlds: the general open source world, and the Open Source Business Intelligence world. Unfortunately it will not be possible to explain the whole field of BI in 15 minutes, so the focus will be on Matt's line of expertise, ETL and the Kettle project more specific.

Kettle basically is an extremely fast tool to extract data from many different sources, such as web pages, excel documents, databases, you name it. (E = Extraction in ETL)

We can transparently grab that data and then manipulate, transform or process that data in just about any possible way you can imagine. (T = Transformation in ETL)

Then, after combining data, changing it, mapping values and whatnot, you can store or load the result in whatever format or medium you like, including text files and all possible databases. (L = Loading in ETL)

Some examples:

  • Load data from text files, XML files to store it into a database
  • Export of database(s) to text-file(s) or other databases
  • Import of data into databases, ranging from text-files to excel sheets
  • Data migration between database applications
  • Exploration of data in existing databases (tables, views, etc.)
  • Information enrichment by looking up data in various information stores
(databases, text-files, excel sheets and more )
  • Data cleaning by applying complex conditions in data transformations
  • Application integration
  • Data warehouse population with built-in support for slowly changing
dimensions, junk dimensions and much, much more. To do all these things we provide an easy to use graphical user interface. For a screen shot, see here: http://www.pentaho.com/images/transformation_screenshot.png One of the first things that was done by Kettle (and myself 4-5 years ago) was the capturing of the traffic data for the Flanders Traffic Centre (Vlaams Verkeerscentrum) in Wilrijk: http://www.verkeerscentrum.be/verkeersinfo/default This source data is updated every minute and loaded from thousands of sources accross the state, referenced, cleaned and loaded in a single Terabyte database containing years of history and billions of rows of data. That database is then used to do traffic pattern analyses, traffic density reports, planning of road works, etc.