aligning MWT's and named entities with simply tokenized text

Submitted by herschelrs on Tue, 06/25/2024 - 15:25
Forums

I'm working on an application which requires aligning a text (in Spanish, though I imagine this applies in general) where I need to associate the lemma with each token for a text, tokenized "naturalistically", let's say. So far I've understood how to get Freeling to produce "naturalistic" tokenization but at the cost of the rest of the analysis, at least for that token. The conllu spec obviously allows for mwt's to be represented in a way to preserve the original tokenization, but I haven't been able to find a way to get Freeling to output something like this.

(I've spent some time looking at the docs and api reference at this point but haven't read them carefully, it's likely I'm missing some important pieces. In particular I couldn't find any specific documentation for op.invoke_opt.MACO_AffixAnalysis or op.invoke_opt.MACO_MultiwordsDetection. The latter eg. controls separating reflexive verbs with a pronoun, but if it's turned off the lemmatization for that token will fail, for the latter I couldn't quite understand what it was doing but it seemed surely relevant.)

Naturally there's also a symmetric problem with named entities, and while one could split the form on underscore, word.is_multiword() is less than flawless for distinguishing tokens *that contain an underscore*, eg. if a variable name occurred in a text, and this would make aligning freeling's tokenization much more difficult.

FreeLing tokens include information about their original position in the text. This can be retrieved via the methods word::get_span_start and word::get_span_finish
https://freeling-user-manual.readthedocs.io/en/latest/language-classes/

So, after calling the library and getting the analysis, you can get the original position of each token and align them with the input text string.

There are some examples of how to use FreeLing as a library at:
https://freeling-tutorial.readthedocs.io/en/latest/

Also, information on available options to handle tokenization of compounds, contractions, clitics, etc. can be found in the user manual:
https://freeling-user-manual.readthedocs.io/en/latest/modules/dictionary/

If you want a conll-like representation, you can use the class output_conll to produce it: https://freeling-user-manual.readthedocs.io/en/latest/modules/io/ (although this doesn't make much sense if you are using FL as a library)

Finally, remember than FreeLing has an Affero GPL license, so if the app you are developing is to be eventually distributed or deployed as a SaaS under a proprietary license, you'll need to acquire a dual license allowing this kind of usage.