Speakers | |
---|---|
Nils Grünwald | |
Schedule | |
Day | Saturday |
Room | AW1.124 |
Capacity | 59 |
Start time | 14:45 |
End time | 15:00 |
Duration | 00:15 |
Info | |
Track | Data Analytics devroom |
Tools and Methods for Web Data Extraction
Panoramic view of the field of web data extraction with methods and libraries in various languages for efficient data extraction from HTML pages.
Websites are one of the main source of data on the web. These websites can be classified in diverse categories (blogs, forums, news, etc.) This data is usually available only as part of html pages, which makes it hard to extract rich information from it. It is however often very desirable to differentiate navigation menu from content, or to extract metadata like date, title, author of the page content.
This talk proposes to give a panoramic view of the field of web data extraction. It will show what are the problems encountered and the benefits that can be expected from better analysis of web pages for a variety of applications. It will present the different methods and libraries in various languages that allow to design efficient data extraction systems, from simple text heuristics to more sophisticated visual and classifiers-based approaches.
Audience: Beginner to Intermediate.
Concurrent events:
Next (up to 3) talks in the same room (AW1.124):
When | Event | Track |
---|---|---|
15:00-15:15 | Datalift, A catalyser for the Web of data | Data Analytics |
15:30-15:45 | GDL - GNU Data Language | Data Analytics |
15:45-16:00 | Informal discussion (original talk cancelled) | Data Analytics |