Personal articles in Catalan: "En Pere" & "na Maria"

Submitted by Sebastià Salvà… on Mon, 11/16/2020 - 20:14

Dear all,

Any idea or suggestion for modifying some catalan FreeLing language DAT file(s) in order to get the proper tokenization and labeling for the personal articles before a proper a name in Catalan?

For instance: "en Pere" should be labeled "en DA0MS | Pere NP00SP0", not as a preposicional phrase ("en SP | Pere NP00SP0"). And "na Maria" should be tokenized in two words and tagged such as "na DA0FS | Maria_NP, not as "na_Maria_NP00SP0" (with a "_" merging the article and the proper name).

The tagger is statistical, and the frequency of "en" as a DA is 0.08%, which makes it very unlikely that this reading is selected.

However, you may try biassing that decision as follows:

edit the file /usr/local/share/freeling/ca/tagger.dat, and in section <Forbidden> add a new line:

 *.SP<en>.NP

However, be aware that this will cause that "en" is *never* tagged as a preposition before a NP, which migth introduce errors in other sentences.

Regarding "Na_Marta", that happens because the default NE recognizer is also a ML model, and has probably not seen that case often.  You can solve that using an alternative NE recognizer:  Add "--fnp /usr/local/share/freeling/ca/np.dat" to your analyzer command.

However: If "Na" is capitalized and not at the beginning of the sentence, it will be considered part of the NP by any NER system.

 

Sebastià Salvà…

Mon, 11/23/2020 - 21:29

Thanks for your answer.

Could I also try to introduce this other exception: "*.SP.NP00SP0", in order just to exclude the person proper names (not the place proper names)?