Wrong determined PoS and syntactic function

Submitted by carlesg on Fri, 05/27/2016 - 16:15
Forums

Hello,

I have a problem with how FreeLing determines the function of some words. Sometimes the PoS and the determined syntactic function is not correct.

For example, the word 'pago' can be considered a verb (verb 'pagar') or a noun, depending on the context of the word.
How does Freeling determine whether a word acts as a verb or as a noun, if it has both meanings?
It seems that the analysis does not detect this context, and determine wrongly the sense and the function of the word. Why does it happen?
Can I configure something, or can I do some trick or bypass to improve this behavior?

As an example, on the following three sentences, the word 'pago' is used as a verb, but in the second and third sentences it is being identified as a Noun (I assume that this is the reason why consequently the syntactic analysis and the tree Dependencies are also erroneous).


Yo pago la cena.
grup-verb/top/(pago pagar VMIP1S0 -) [
sn/subj/(Yo yo PP1CSN00 -)
sn/dobj/(cena cena NCFS000 -) [
espec-fs/espec/(la el DA0FS0 -)
]
F-term/term/(. . Fp -)
]


¿ Cuánto pago por la cena?
F-no-c/top/(¿ ¿ Fia -) [
sn/modnorule/(pago pago NCMS000 -) [
espec-ms/espec/(Cuánto cuánto DT0MS0 -)
]
grup-sp/modnorule/(por por SPS00 -) [
sn/obj-prep/(cena cena NCFS000 -) [
espec-fs/espec/(la el DA0FS0 -)
]
F-term/term/(? ? Fit -)
]
]


¿ Cuánto pago de hipoteca?
F-no-c/top/(¿ ¿ Fia -) [
sn/modnorule/(pago pago NCMS000 -) [
espec-ms/espec/(Cuánto cuánto DT0MS0 -)
sp-de/sp-mod/(de de SPS00 -) [
sn/obj-prep/(hipoteca hipoteca NCFS000 -)
]
F-term/term/(? ? Fit -)
]
]

I use version 3.1 of Freeling, from a java program (via your native interface compiled with Swig). The language I use is Spanish.

You must keep in mind that state-of-the-art PoS taggers perform around 97-98%
That means they will fail one of every 50 words, which means about one error in one sentence out of four.

The tagger decides the tag depending on the context.
In the first sentence, the probability of a noun after "yo" is very low, so the selected option is verb.

In the second and third sentences, the probability of a noun after "Cuánto" is probably similar or higher to that of a verb (e.g. "Cuánto pan has comprado", "Cuánto pollo quieres", etc..) and the overall balance of probabilities falls in the side of NC.

Often, this mistakes are due to a poor estimation of the lexical probabiities of the word (because it was observed too few times in the training corpus).
These statistics are in file "probabilitats.dat"
For "pago", you'll find the line:

pago NC-VMI NC 21 VMI 0

Meaning that "pago" was seen 21 times as noun and 0 as verb.
Even the probabilities are smoothed the probability for verb is too low to be selected in cases where noun would be possible (such as after "cuanto")

You can get better results increasing the count for "pago" as verb in "probabilities.dat".
E.g.
pago NC-VMI NC 21 VMI 4

You'll need to experiment which is the value that will turn the balance on the right side, but without assigning too much probability to the verb tag (which would cause to get "VMI" when it is actually "NC")