FOSDEM 2015
/
Schedule
/
Events
/
Developer rooms
/
Open source search
/
Searching over streams with Luwak and Apache Samza

Searching over streams with Luwak and Apache Samza

Track: Open source search devroom
Room: UA2.114 (Baudoux)
Day: Saturday
Start: 14:50
End: 15:35

Traditional searches take the form of individual queries over large mostly-stable corpuses of documents. In this talk, we'll show how we invert this paradigm to allow for searching over streams of documents by combining Samza, a distributed stream-processing framework, with Luwak, a library for efficiently running large numbers of queries over individual documents.

Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.

Luwak provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.

Samza provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.

Speakers

Alan Woodward

FOSDEM15

Brussels / 31 January & 1 February 2015

Searching over streams with Luwak and Apache Samza

Speakers

FOSDEM

This year

Practical information

Media and press