FOSDEM is the biggest free and non-commercial event organized by and for the community. Its goal is to provide Free and Open Source developers a place to meet. No registration necessary.

Nils Grünwald
Day Saturday
Room AW1.124
Capacity 59
Start time 14:45
End time 15:00
Duration 00:15
Track Data Analytics devroom

Tools and Methods for Web Data Extraction

Panoramic view of the field of web data extraction with methods and libraries in various languages for efficient data extraction from HTML pages.

Websites are one of the main source of data on the web. These websites can be classified in diverse categories (blogs, forums, news, etc.) This data is usually available only as part of html pages, which makes it hard to extract rich information from it. It is however often very desirable to differentiate navigation menu from content, or to extract metadata like date, title, author of the page content.

This talk proposes to give a panoramic view of the field of web data extraction. It will show what are the problems encountered and the benefits that can be expected from better analysis of web pages for a variety of applications. It will present the different methods and libraries in various languages that allow to design efficient data extraction systems, from simple text heuristics to more sophisticated visual and classifiers-based approaches.

Audience: Beginner to Intermediate.

