Welcome to the MACA project

MACA (Morphological Analysis Converter and Aggregator) is a simple configurable module for
  • tokenisation and sentence splitting – including SRX support
  • morphological analysis (attachment of MSD tags and lemmas to tokens) based on multiple sources of information,
  • simple tagset conversion.

MACA is not a tagger, neither a lemmatiser, nor a parser. Multiword units as such are not supported — the analysis is performed on the token level.

MACA provides:
  • as a set of C++ shared libraries with simple API (useful for writing NLP applications such as taggers and parsers),
  • a couple of command-line tools that perform simple tasks by using the API, including tokeniser, morphological analyser and tagset converter.
    The processing tools operate on streams, enabling pipeline processing.

Acknowledgement: this work is financed by Innovative Economy Programme project POIG.01.01.02-14-013/09.

Why do I need MACA?

Some typical usage scenarios:
  • analyse plain text or XML (possibly very large) using Morfeusz SGJP, Polimorf or Morfeusz SIAT and get a valid XCES corpus as output (Morfeusz itself can't do that)
  • analyse input using a user-supplied morphological dictionaries (simple tab-separated txt files); the dictionaries may be compiled into SFST transducers
  • override Morfeusz with user-supplied dictionaries and/or use dictionaries when Morfeusz fails (complex processing pipelines may be defined)
  • convert XCES corpora into simple greppable plain-text format (or back)
  • simple tagset conversions, including joining words split by Morfeusz and assigning sensible tags to the joined units (and conversion back to the original tagset)
  • tagset simplification, e.g. casting tags to POS/wordclass only
  • divide text into sentences using available SRX rules (this work is actually done by Toki, our tokeniser)

Toki, the tokeniser and SRX sentence splitter may be used separately. MACA and Toki work in the same fashion: user-supplied configuration files define their behaviour.

MACA is bundled with several useful configurations and morphological data, including:
  1. morfo1222-ikipi: free morphological analyser operating on the IKIPI/IPIC tagset (requires SFST plugin; see Morfologik converted)
  2. morfeusz: wrapper around Morfeusz SIaT, outputting in the Morfeusz SIaT unchanged tagset (requires Morfeusz SIaT)
  3. morfeusz-kipi: wrapper around Morfeusz SIaT, performing slight conversion into the exact IPIC tagset (as required by TaKIPI)
  4. morfeusz-kipi-guesser: as above, but also using TaKIPI's guesser for unknown tags (requires libcorpus1 which is bundled with TaKIPI)
  5. morfeusz-nkjp: wrapper around Morfeusz SGJP, performing analogous conversion into the actual NKJP tagset (requires Morfeusz SGJP)
  6. morfeusz2-nkjp: wrapper around Morfeusz SGJP v. 2.0 (requires Morfeusz SGJP v. 2.0)
  7. polimorf-nkjp: wrapper around Morfeusz Polimorf with conversion into actual NKJP tagset (Morfeusz Polimorf must be installed).

NEW: conversion from NKJP to KIPI tagset is provided! Also, morfsgjp-kipi is able to use Morfeusz SGJP and output in KIPI tagset straight away. Read more about Morfeusz versions and configurations.

Besides, there are several useful configurations of toki tokeniser, tagset conversion routines. Note: the morphological dictionaries included in the package are licensed under a different licence: GNU LGPL or Creative Commons ShareAlike (the user is free to choose). See the README files in the package.

Features and limitations

The suite consists of the following modules:
  • Corpus2: corpus I/O and configurable tagset handling (corpus2 library, tagset-tool and corpus-get utilities)
  • tokenisation and SRX sentence splitting (toki library, toki-app util)
  • morphological analysis pipelines with multiple information sources (maca library, maca-analyse util)
  • simple tagset conversion (maca library, maca-convert util)
The morphological analysis component is targeted at the Polish language. This imposes the following:
  • The MSD tags currently must be positional, i.e. each tag consists of wordclass mnemonic and optional sequence of attribute values.
  • String representation of tags is assumed to follow the IPIC-like colon-separated template (dot and underscore shorthands are supported). This is fully compliant with IPIC (all the variants thereof), the new NKJP tagset and technically compliant1 with the tagset of Morfologik.

The user is free to create various configuration files that define behaviour of the processing pipeline. Different sets of morphological analysers may be tied to token labels distinguished during tokenisation. The software supports plain text dictionaries, SFST transducers and Morfeusz.

1 The tagset of Morfologik is not fully specified. A number of variants exist for many grammatical classes, e.g. most nouns are specified for grammatical case but not all of them. Technically, if we define many tagset attributes as optional (including those that intuitively should be obligatory), we can represent all the tags of Morfologik in the desired format.

Documentation

If you are interested in using the existing configurations, proceed to the User_guide.

Instructions on writing custom configurations and compiling morphological data are provided in the doc subdirectory.

If you need to build your own dictionary, this tutorial will be helpful: Custom_dictionaries.

For a verbose description, including background, motivation and typical usage scenarios, see the paper: Adam Radziszewski and Tomasz Śniatowski, “Maca: a configurable tool to integrate Polish morphological data”, FreeRBMT11

@inproceedings{maca,
  author = {Adam Radziszewski and Tomasz \'{S}niatowski},
  title = {Maca --- a configurable tool to integrate {P}olish morphological data},
  booktitle = {Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation},
  year = {2011},
  location = {Barcelona, Spain}
}

If you want to use Maca's Python wrappers, please see the example Python code included in the doc subdir. It is recommended to use the dlopenflags hack to get the SWIG module to work properly. The folling lines should be added before importing maca:

import ctypes, sys
sys.setdlopenflags(sys.getdlopenflags() | ctypes.RTLD_GLOBAL)

Without the hack, SWIG apparently has problems with loading dynamic libraries with dlopen (e.g. you might get an error messages that some plugin, e.g. morfeusz, could not be found, despite it is installed properly and works with maca-analyse app).

Morfologik converted

We have performed a massive conversion of the Morfologik dictionary into the IPIC tagset (strictly speaking, into an intermediate tagset but a conversion routine is provided). Although the conversion has been performed cautiously, the resulting data are not perfect and may contain a number of errors.

The resulting morphological dictionary is compiled into a transducers and distributed with Maca (morfo1222-ikipi configuration). For documentation and source file downloads, visit Morfologik converted.

Download and install MACA

NEW: detailed installation instruction for Ubuntu 11.10

MACA is comprised of the following packages:
  1. libpwrutils: common utilities used by the other libraries (contained within corpus2 repository)
  2. corpus2: shared library for data structures and XML corpus reading/writing, tagset routines and tagset inspection util (tagset-tool)
  3. toki: configurable tokeniser; library and toki-app util
  4. maca: configurable morphological analyser library with tagset conversion support, maca-analyse and maca-convert utils
Besides, you will need the following dependencies:
  1. CMake 2.8 or later (to build the above packages)
  2. ICU (libicu-dev)
  3. Boost libraries, 1.41 or later (libboost1.42-all-dev; tested with 1.41 and 1.42)
  4. Loki (libloki-dev)
  5. LibXML++ (libxml++2.6-dev)
  6. bison and flex
Plugins require other libraries that are optional:
  • The SFST plugin (recommended) requires the SFST library (how to install; GPL)
  • The Morfeusz plugin (highly recommended for Polish) requires the Morfeusz SGJP library and header (2 clause BSD)
  • The Guesser plugin requires the Guesser package from TaKIPI / Corpus (GPL)

If you need Python support for corpus2 (recommended), you need SWIG and Python installed with headers (in Ubuntu, look for swig and python-dev packages).

To install Morfeusz SGJP, you need to download a package appropriate for your system from SGJP site. If you use the binary version, make sure to select proper version (32 or 64) and install the library files (.so) into system directory (e.g. /usr/local/lib), header into include directory (e.g. /usr/local/include) and run ldconfig to have the system library info updated. NOTE: we haven't tested Morfeusz Polimorf and can't guarantee that it will work with existing MACA configurations (there is no tagset information on the SGJP site so it is possible that it operates on another tagset and hence it may beed another .ini and .conv files to work).

Licence. This software (without SFST and Guesser plugins) is licensed under GNU LGPL 3.0.
Including SFST and/or Guesser plugin into MACA affects its license, resulting in the whole code being licensed as GNU GPL 3.0.
SFST plugin provides support for user-supplied dictionaries compiled into transducers. Guesser plugin enables tag guessing for forms not found elsewhere.

Toki includes Marcin Miłkowski's SRX rules for sentence splitting. These rules (segment.srx) are licensed under GNU LGPL.

Maca includes a couple of free morphological dictionaries developed and converted by Adam Radziszewski and Marek Maziarz. These dictionaries are licensed under either GNU LGPL or Creative Commons ShareAlike (the user is free to choose). The “source code” of large dictionaries (tab-separated text files) can be found here.

It is recommended to obtain the most recent sources from Git repositories:

git clone http://nlp.pwr.wroc.pl/corpus2.git
git clone http://nlp.pwr.wroc.pl/toki.git
git clone http://nlp.pwr.wroc.pl/maca.git

After installing the required dependencies, you must install each package separately (i.e. corpus2, toki and then maca), e.g.:

mkdir corpus2/bin
cd corpus2/bin
cmake ..
# confirm the default values with ENTER
# analyse the output, if some required dependencies are missing, install the lacking packages, remove CMakeCache.txt file and re-run cmake
make
sudo make install
sudo ldconfig
# optionally make test

Reporting bugs and feature requests

How to report issues?

Related document drafts

  1. Tagset pośredni (IKIPI description in Polish)
  2. Akronimy i apostrofy (semi-automatic acquisition of morphological data for acronyms and expressions with apostrophes, working report)

Maca.jpg (21,117 KB) Adam Radziszewski, 26 sty 2011 12:58