Linguistic Data

FreeLing distributions include linguistic data for the supported languages.

One of the core data files for any language is the morphological dictionary. FreeLing dictionaries are obtained from different open-source external projects, so they have different copyright holders and different licenses than the rest of FreeLing packages. Please check the COPYING file for details about which license applies to each dictionary.

Although the dictionaries for some latin languages may seem small, it must be taken into account that many more forms than those in the dictionary are recognized thanks to a powerful affix analysis module able to detect -among others- enclitic pronoun verbal forms and diminutive/augmentative suffixed forms. Also, a probabilistic suffix-based guesser of word categories is used for words not found in the dictionary.

  • The Asturian dictionary contains more than 140,000 forms corresponding to some 40,000 combinations lemma-PoS.
  • The Catalan dictionary contains more than 520,000 forms corresponding to 71,000 combinations lemma-PoS.
  • The Welsh dictionary contains some 345,000 forms, corresponding to over 8,500 lemma-PoS combinations.
  • The German dictionary contains near 395,000 forms, corresponding to almost 130,000 lemma-PoS combinations.
  • The English dictionary was automatically extracted from
    WSJ and other corpuses, with accurate manual post-edition and completion.
    It contains about 68,000 forms corresponding to some 37,000 different
    combinations lemma-PoS.
  • The Spanish dictionary contains over 555,000
    forms corresponding to more than 76.000 lemma-PoS combinations.
  • The French dictionary contains over 350,000
    forms corresponding to more than 54.000 lemma-PoS combinations.
  • The Galician dictionary contains some 430,000 forms, corresponding to near 50,000 lemma-PoS combinations.
  • The Italian dictionary contains over 360,000 forms corresponding to 40,000 lemma-PoS combinations.
  • The Norwegian dictionary contains 618,000 forms corresponding to 145,000 lemma-PoS combinations.
  • The Portuguese dictionary contains about 909,000 forms corresponding to 105,000 lemma-PoS combinations.
  • The Russian dictionary contains over 1,630,000 forms corresponding to more than 510,000 lemma-PoS combinations.
  • The Slovenian dictionary contains about 920,000 forms corresponding to some 94,000 lemma-PoS combinations.

Each of these morphological dictionaries was obtained from a different source, and has its specific distribution license. Check the COPYING file in the tarball to make sure you're not violating any of them.

FreeLing also includes WordNet-based sense dictionaries for languages which it is available for:

  • The English sense dictionary is straightforwardly extracted from WN 3.0 and therefore is distributed under the terms of WN license. You'll find a copy in the WN.license file.
  • Data in sense dictionaries for Catalan, Galician, and Spanish are extracted from the MCR3.0 project. They are distributed under their original Creative Commons Attribution license (CC-BY).
  • Data for Italian WordNet are extracted from OpenWN project MultiWordnet project developed by Fondazione Bruno Kessler. It is distributed under its original CC-BY License.
  • Data for Portuguese WordNet are extracted from brazilian-wn project. It is distributed under its original CC-BY license
  • Data for Slovenian WordNet are extracted from sloWNet project developed by Darja Fišer (Fišer 2012). It is distributed under its original Creative Commons Attribution-ShareAlike 4.0 License (CC-BY-SA).
  • Data for Croatian WordNet are extracted from croWN, developed by University of Zagreb, available at Open Multilingual WN page. It is distributed under its original Creative Commons Attribution License (CC-BY).