FOSDEM is the biggest free and non-commercial event organized by and for the community. Its goal is to provide Free and Open Source developers a place to meet. No registration necessary.

Nils Grünwald
Day Saturday
Room AW1.124
Capacity 59
Start time 14:45
End time 15:00
Duration 00:15
Track Data Analytics devroom

Tools and Methods for Web Data Extraction

Panoramic view of the field of web data extraction with methods and libraries in various languages for efficient data extraction from HTML pages.

Websites are one of the main source of data on the web. These websites can be classified in diverse categories (blogs, forums, news, etc.) This data is usually available only as part of html pages, which makes it hard to extract rich information from it. It is however often very desirable to differentiate navigation menu from content, or to extract metadata like date, title, author of the page content.

This talk proposes to give a panoramic view of the field of web data extraction. It will show what are the problems encountered and the benefits that can be expected from better analysis of web pages for a variety of applications. It will present the different methods and libraries in various languages that allow to design efficient data extraction systems, from simple text heuristics to more sophisticated visual and classifiers-based approaches.

Audience: Beginner to Intermediate.

Next (up to 3) talks in the same room (AW1.124):

When Event Track
15:00-15:15 Datalift, A catalyser for the Web of data Data Analytics
15:30-15:45 GDL - GNU Data Language Data Analytics
15:45-16:00 Informal discussion (original talk cancelled) Data Analytics

Events that start after this one (within 30 minutes):

When Event Track Where
15:00-15:15 Datalift, A catalyser for the Web of data Data Analytics AW1.124
15:00-15:15 Neo4j: Graph DB and Neo4j introduction Lightning Talks Ferrer
15:00-15:20 Making happy developers Mono AW1.120
15:00-15:20 Why Linux Distributions Hate Java Free Java AW1.125
15:00-15:25 Data Recovery for MySQL MySQL & friends H.2213
15:00-15:30 GNU Parallel GNU H.2214
15:00-15:30 CiviCRM & XMPP as your personal assistant Jabber & XMPP AW1.121
15:00-15:30 Fribid and browser security software Security & hardware crypto AW1.105
15:00-15:50 Qt WebKit goes Mobile Web Browsing Chavanne
15:00-15:50 systemd: Beyond init System Janson
15:00-16:00 Flukso - Community Metering. Embedded Lameere
15:00-16:00 Downstream packaging collaboration CrossDistro H.1308
15:00-16:00 Bringing OMAP3 into the FreeBSD Tree BSD AW1.126
15:00-16:00 Fast and Flexible UI Development with EtoileUI World of GNUstep AW1.117
15:00-16:00 Amazing openSUSE CrossDistro H.1302
15:20-15:35 iRail: Creating a public transport API Lightning Talks Ferrer
15:20-15:40 The Java Packaging Nightmare Free Java AW1.125
15:30-15:45 GDL - GNU Data Language Data Analytics AW1.124
15:30-15:45 Group Photo Crossdesktop H.1309
15:30-15:55 Taking Backups With XtraBackup MySQL & friends H.2213
15:30-16:00 EJBCA and OpenSC Security & hardware crypto AW1.105
15:30-16:00 Mono packaging in Debian and Ubuntu - why we're always right Mono AW1.120
15:30-16:00 XMPP Conf Calls: One Way or Another Jabber & XMPP AW1.121
15:30-16:15 Moving to the client - HTML5 is here Mozilla H.1301