Brussels / 31 January & 1 February 2015


EBISearch - Biological data search engine

EBISearch is a text search engine providing access to biological data resources hosted at EMBL-EBI. We will speak about the history of this engine, the infrastracture used, some statistics and future plans.

EBISearch is a text search engine providing access to biological data resources hosted at EMBL-EBI. These Lucene indexes are organized in a hierarchy and we offer an easy 'inter-index' (or, in Lucene terms: inter-domain) navigation via a network of cross-references.

The data resources represented in the EBI Search engine include: biological sequences, chemicals and macro-molecular structures, bio-medical literature abstracts and meta-information related to biological entities (e.g. genes, transcripts, proteins, etc.)

The EBISearch evolution/development is influenced by the EMBL-EBI IT infrastructure, which is designed to cope with great amount of data and relies on technical choices about data storage on network/distributed filesystem and heterogeneous type of hosts.

The EBISearch engine provides search accross ~1.1bn documents updated to the last biologial data available; we will provide some statistics about the amount of data we index; indexing parallelisation and the lifecycle of these indexes.

At search time the engine organize the indexes in a hierarchy and searches are executed across most domains. This allows us to benefit of homogeneus score across the indexes. We rely heavily on the use facets for filtering results (Lucene taxonomies)..

EBISearch usage is monitored and analyzed through our application logs as well as web logs; we will present some statistics about usage and discuss which kind of theme we are looking into; the focus is in understanding usage patterns to drive our next development.

In the future, we need to explore how to cope with the increasing volume of data that are being generated in the bio-medical fields and on how to handle requests for new functionalities; for these reasons we will investigate the usage of other technologies and existing search engines based on Lucene. We are also interested in different types of data visualization and we will briefly present what we have done in this area.


Nicola Buso