Speakers
	Nils Grünwald
Schedule
Day	Saturday
Room	AW1.124
Capacity	59
Start time	14:45
End time	15:00
Duration	00:15
Info
Track	Data Analytics devroom

Tools and Methods for Web Data Extraction

Panoramic view of the field of web data extraction with methods and libraries in various languages for efficient data extraction from HTML pages.

Websites are one of the main source of data on the web. These websites can be classified in diverse categories (blogs, forums, news, etc.) This data is usually available only as part of html pages, which makes it hard to extract rich information from it. It is however often very desirable to differentiate navigation menu from content, or to extract metadata like date, title, author of the page content.

This talk proposes to give a panoramic view of the field of web data extraction. It will show what are the problems encountered and the benefits that can be expected from better analysis of web pages for a variety of applications. It will present the different methods and libraries in various languages that allow to design efficient data extraction systems, from simple text heuristics to more sophisticated visual and classifiers-based approaches.

Audience: Beginner to Intermediate.

Concurrent events:

When	Event	Track	Where
14:00-14:50	The life of a Firefox feature	Web Browsing	Chavanne
14:00-14:50	DevOps? - More than Marketing	System	Janson
14:00-15:00	mk-configure	BSD	AW1.126
14:00-15:00	Using NixOS for declarative deployment and testing	CrossDistro	H.1302
14:00-15:00	Advanced Experiments with XMOS Multicore Embedded Hardware.	Embedded	Lameere
14:00-15:00	Swimming Upstream	CrossDistro	H.1308
14:00-15:45	LPI Exam 1	Certification	Guillissen
14:00-16:00	TYPO3 Exam Session	Certification	UA2.114
14:00-16:00	BSD Associate Exam Session	Certification	UA2.114
14:30-14:55	Over 20,000QPS, XtraDB performance show	MySQL & friends	H.2213
14:30-14:55	The Web Objects Kitchen	Mono	AW1.120
14:30-15:00	In-tab UI	Mozilla	H.1301
14:30-15:00	Stump the XMPP Experts! Open Q&A	Jabber & XMPP	AW1.121
14:40-14:55	0MQ: Multithreading magic	Lightning Talks	Ferrer
14:45-15:00	CyaSSL	Security & hardware crypto	AW1.105
14:45-15:00	EtoileText	World of GNUstep	AW1.117
14:45-15:30	Gallium state trackers applied to 2D rendering libraries	Crossdesktop	H.1309

Next (up to 3) talks in the same room (AW1.124):

When	Event	Track
15:00-15:15	Datalift, A catalyser for the Web of data	Data Analytics
15:30-15:45	GDL - GNU Data Language	Data Analytics
15:45-16:00	Informal discussion (original talk cancelled)	Data Analytics

Events that start after this one (within 30 minutes):

When	Event	Track	Where
15:00-15:15	Datalift, A catalyser for the Web of data	Data Analytics	AW1.124
15:00-15:15	Neo4j: Graph DB and Neo4j introduction	Lightning Talks	Ferrer
15:00-15:20	Making happy developers	Mono	AW1.120
15:00-15:20	Why Linux Distributions Hate Java	Free Java	AW1.125
15:00-15:25	Data Recovery for MySQL	MySQL & friends	H.2213
15:00-15:30	GNU Parallel	GNU	H.2214
15:00-15:30	CiviCRM & XMPP as your personal assistant	Jabber & XMPP	AW1.121
15:00-15:30	Fribid and browser security software	Security & hardware crypto	AW1.105
15:00-15:50	Qt WebKit goes Mobile	Web Browsing	Chavanne
15:00-15:50	systemd: Beyond init	System	Janson
15:00-16:00	Flukso - Community Metering.	Embedded	Lameere
15:00-16:00	Downstream packaging collaboration	CrossDistro	H.1308
15:00-16:00	Bringing OMAP3 into the FreeBSD Tree	BSD	AW1.126
15:00-16:00	Fast and Flexible UI Development with EtoileUI	World of GNUstep	AW1.117
15:00-16:00	Amazing openSUSE	CrossDistro	H.1302
15:20-15:35	iRail: Creating a public transport API	Lightning Talks	Ferrer
15:20-15:40	The Java Packaging Nightmare	Free Java	AW1.125
15:30-15:45	GDL - GNU Data Language	Data Analytics	AW1.124
15:30-15:45	Group Photo	Crossdesktop	H.1309
15:30-15:55	Taking Backups With XtraBackup	MySQL & friends	H.2213
15:30-16:00	EJBCA and OpenSC	Security & hardware crypto	AW1.105
15:30-16:00	Mono packaging in Debian and Ubuntu - why we're always right	Mono	AW1.120
15:30-16:00	XMPP Conf Calls: One Way or Another	Jabber & XMPP	AW1.121
15:30-16:15	Moving to the client - HTML5 is here	Mozilla	H.1301

fosdem.org

User login

Tools and Methods for Web Data Extraction

Concurrent events:

Next (up to 3) talks in the same room (AW1.124):

Events that start after this one (within 30 minutes):