Brussels / 30 & 31 January 2016

schedule

IXA pipes: Easy and ready use NLP tools for language communities

Free NLP tools for several languages, including Basque, Galician, Spanish


IXA pipes (http://ixa2.si.ehu.es/ixa-pipes/) is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology for several languages. It offers robust and efficient linguistic annotation to both researchers and non-NLP experts with the aim of lowering the barriers of using NLP technology either for research purposes or for small industrial developers and SMEs. The ixa pipes can be used or exploit its modularity to pick and change different components. Every ixa pipe can be up an running after two simple steps. The tools require Java 1.7+ to run and are designed to come with all batteries included, which means that it is not required to do any system configuration or install any third-party dependencies. The modules will run on any platform as long as a JVM 1.7+ is available.

IXA pipes are just a set of processes chained by their standard streams, in a way that the output of each process feeds directly as input to the next one. The Unix pipes metaphor has been applied for NLP tools by adopting a very simple and well known data centric architecture, in which every module/pipe is interchangeable by any other tool as long as it reads and writes the required data format via the standard streams.

The data format in which both the input and output of the modules needs to be formatted to represent and pipe linguistic annotations is NAF. We currently covered tokenization, pos tagging, lemmatization, Named Entity Recognition and classification and probabilistic parsing, but further annotations and languages can be easily added. The tools are distributed under Apache License 2.0.

I would prefer to keep the theoretical part as short as possible and do some practical work with the modules. In order to save time, it will be nice (although not compulsory) if attendants would come with a laptop with the following components installed:

  • Java Development Kit 1.7+
  • Apache Maven 3.+
  • git
  • Datasets/Corpora such as:
  • http://universaldependencies.github.io/docs/
  • CoNLL 2002 NER data http://www.clips.uantwerpen.be/conll2002/ner/

The idea is to download, compile, tag texts and train your own models in a very short time using IXA pipes.

Speakers

Rodrigo Agerri

Attachments

Links