Verb at the beginning of a sentence in spanish

Submitted by samuelhrg on Tue, 01/10/2017 - 18:23

I found a small problem with the PoS tagging. In the sentence "Esperé a que volvieras" Freeling tags "Esperé" as a noun. If instead I use "Yo esperé a que volvieras" then "Yo" is tagged as the noun and "esperé" as the verb.
I can guess where the problem comes from. In spanish you don't need to specify the noun at the beginning of the sentence because the verb gives you this information. "Esperé" means "I waited", there's no need to write "Yo esperé".
At first I thought it wasn't a big deal but I can see it's going to be a big problem because it's pretty common to ommit the noun at the beginning of a sentence in spanish.

It works for me:

$ echo "Yo esperé a que volvieras." | analyze -f es.cfg
Yo yo PP1CSN0 1
esperé esperar VMIS1S0 1
a a SP 0.998775
que que CS 0.449861
volvieras volver VMSI2S0 1
. . Fp 1


$ echo "Esperé a que volvieras." | analyze -f es.cfg
Esperé esperar VMIS1S0 1
a a SP 0.998775
que que CS 0.449861
volvieras volver VMSI2S0 1
. . Fp 1

Neither "esperé" nor "yo" are nouns in the dictionary, so you are mixing something up (e.g. using English configuration to analyzer Spanish text)
Maybe if you provide some more details, we can spot what are you doing wrong.
Which freeling version are you using? Which command are you using?

samuelhrg

Mon, 01/16/2017 - 16:55

Sorry I took so long to reply.
I'm using Freeling 4.0 on Windows. I did some testing:

Yo esperé a que volvieras.
Esperé a que volvieras.

D:\Dropbox\CENIDET\Ruby>analyzer.bat -f "C:\Freeling-4.0\data\config\es.cfg" < texto.txt
  Fz 1
Yo yo NP00000 1
esperé esperar VMIS1S0 1
a a SP 0.998775
que que CS 0.449861
volvieras volver VMSI2S0 1
. . Fp 1

Esperé esperar VMIS1S0 1
a a SP 0.998775
que que CS 0.449861
volvieras volver VMSI2S0 1
. . Fp 1

That works fine, except that it marks "Yo" as a Proper Noun instead of a Pronoun.
Then I noticed that is the full text what's giving me trouble.

Esperé a que volvieras, Dra. Azucena. Durante días, semanas, años, quizá toda la vida.

D:\Dropbox\CENIDET\Ruby>analyzer.bat -f "C:\Freeling-4.0\data\config\es.cfg" < poema.txt
  Fz 1
Esperé esperé NP00000 1
a a SP 0.998775
que que CS 0.449861
volvieras volver VMSI2S0 1
, , Fc 1
Dra._Azucena dra._azucena NP00000 1
. . Fp 1

Durante durante SP 1
días día NCMP000 1
, , Fc 1
semanas semana NCFP000 1
, , Fc 1
años año NCMP000 1
, , Fc 1
quizá quizá RG 1
toda todo DI0FS0 0.984467
la el DA0FS0 0.98926
vida vida NCFS000 1
. . Fp 1

Now I'm guessing it has something to do with the context, but I'm not sure what's happening.

In the output of your command:

D:\Dropbox\CENIDET\Ruby>analyzer.bat -f "C:\Freeling-4.0\data\config\es.cfg" < texto.txt
  Fz 1
Yo yo NP00000 1
esperé esperar VMIS1S0 1
a a SP 0.998775
que que CS 0.449861
volvieras volver VMSI2S0 1
. . Fp 1

you can see that there is a line " Fz 1" before "Yo".

This line is a token that FreeLing created because there was something at the beggining of your file (maybe a BOM marker or the like... windows editors tend to add this kind of garbage to plain text files)
This character is recognized as "Fz" which means "unknown punctuation sign"
Then, comes "Yo", which is a capitalized word after a non-sentence ending punctuation (Fz is not in the list of sentence-ending punctuations), thus it is interpreted as proper noun.

Make sure that your file does not have binary codes at the beggining, and you should be ok.

antoniofrias

Wed, 03/21/2018 - 11:35

FreeLing 4.0
$ echo "Antes de irse me repitieron varias veces que me portase bien. Mamá me dio los últimos consejos." | analyze -f es.cfg

Antes_de antes_de SP 1
ir ir VMN0000 1
se se PP3CN00 1
me me PP1CS00 0.755196
repitieron repetir VMIS3P0 1
varias varios DI0FP0 0.966309
veces vez NCFP000 1
que que PR0CN00 0.550139
me me PP1CS00 0.755196
portase portar VMSI3S0 0.634863
bien bien RG 0.876088
. . Fp 1

Mamá mamá NP00000 0.331016
me me PP1CS00 0.755196
dio dar VMIS3S0 1
los el DA0MP0 0.992728
últimos último AO0MP00 1
consejos consejo NCMP000 1
. . Fp 1

Here, FreeLing tags 'Mamá' as a proper noun -it's a bug? However,

$ echo "Antes de irse mamá me dio los últimos consejos." | analyze -f es.cfg

Antes_de antes_de SP 1
ir ir VMN0000 1
se se PP3CN00 1
mamá mamá NCFS000 1
me me PP1CS00 0.755196
dio dar VMIS3S0 1
los el DA0MP0 0.992728
últimos último AO0MP00 1
consejos consejo NCMP000 1
. . Fp 1

FreeLing proper noun detector heavily relies on capitalization: A capitalized word in the middle of a sentence, will be marked as a proper noun (except in a small list of exceptions).
But at the beggining of a sentence, all words are capitalized, so FreeLing needs to resort to other heuristics, such as:

- An out-of-dictionary word at the beginning of a sentence is a proper noun

- A word at the beginning of a sentence which is in the dictionary as a verb, adverb, preposition, etc..is not considered to be a possible proper noun.

- A word at the beginning of a sentence which is in the dictionary as an adjective or a noun, is marked as possible proper noun, and the tagger decides whether it is marked as NC or NP.  The rationale is that many person names in Spanish are adjectives or nouns (e-g "Zapatero asistió a la reunión", "Blanco declaro su inocencia", etc.

In that particular case of "Mamá", it seems that despite the initial probability is higher for NC, the context drives the tagger to select NP as the most likely. This is caused by two things:

- First, the word "mamá"  was never seen in the training data (FreeLing is trained on a newspaper corpus), so the tagger is just using a default probabilitiy model.

- Second, and more decisive, the use of "Mamá" in this sentence is that of a proper noun. Common nouns do not start senteces often in Spanish, since they usually are preceded by a determiner. E.g. You say "Mi amigo me dió consejos", o "Juan me dió consejos", but not "Amigo me dió consejos". Thus, the probability of selecting NP for "Mamá" in this sentece is boosted by the fact that its syntactic behaviour here is that of a NP.

So, in summary:  NLP tools -including FreeLing- are not perfect, because they train on data, and data is always skewed towards certain words or domains (Zipf's law).  Also, there are exceptions that do not behave as the general rule (as "Mamá" starting a sentence and behaving like a NP) that are hard to capture statistically.

Thus, PoS taggers commit mistakes, so you should expect about 2% or 3% of the words to have a wrong PoS tag,  about 10% of the named entities to have a mistake either in its span or its class, about 10% of the words to have the wrong head in a dependency parse tree, etc...

Those are not bugs.
They are just the limits of the current technology.

antoniofrias

Tue, 03/27/2018 - 18:49

OK.
I use Freeling mainly to analyze literature texts. It's a very good and useful technology.
I find in literature texts that sometimes common nouns start sentences. (e.g. "Me lo reprochas con razón. Amigo mío, ya ves que soy débil")
Thank you.

Sure NC can start sentences, it is just less frequent. And frequency is very important for a statistical tagger.

But it is not the position of the word, it also depends on how good is the statistical representation of the word in question.

e.g. your example is tagged correcty:

Me me PP1CS00 0.755196
lo lo PP3MSA0 0.334764
reprochas reprochar VMIP2S0 1
con con SP 1
razón razón NCFS000 1
. . Fp 1

Amigo amigo NCMS000 0.909449
mío mío AP0MS1S 0.340007
, , Fc 1
ya ya RG 0.999785
ves ve NCFP000 0.959746
que que CS 0.449861
soy ser VSIP1S0 1
débil débil AQ0CS00 0.98913
. . Fp 1

because "amigo" did happen in the training corpus, thus it has a more accurate statistical model than "mamà", which was never observed.

 

antoniofrias

Sat, 03/31/2018 - 10:07

I know frequency is very important for a statistical tagger.
In the example ("Amigo mío ya ves que soy débil.") , 'amigo' is tagged correctly, but 'ves' is tagged as NCFP: "ves ve NCFP000 0.959746".
According to DRAE, "ve. f. Nombre de la letra v."
Whatever be the training corpus, "ves" as NCFP -the name of a letter- is rather sparse. 'ves' occurs 4823 times in CREA. The most frequent occurrence of "ves" is as verb.
Some examples from "Fortunata y Jacinta" -'ves' occurs 70 times-:

$ echo "Si te ves en el trance." | analyze -f es.cfg
Si si CS 0.999827
te te PP2CS00 0.749042
ves ver VMIP2S0 0.0402535
en en SP 1
el el DA0MS0 1
trance trance NCMS000 0.926209
. . Fp 1

$ echo "Ya ves que te canto claro." | analyze -f es.cfg
Ya ya RG 0.999785
ves ve NCFP000 0.959746
que que PR0CN00 0.550139
te te PP2CS00 0.749042
canto canto NCMS000 0.989725
claro claro AQ0MS00 0.610577
. . Fp 1
(Here, 'canto' is tagged as NCMS)

$ echo "Recibes a Dios, y ves a tu nena." | analyze -f es.cfg
Recibes recibir VMIP2S0 1
a a SP 0.998775
Dios dios NP00000 1
, , Fc 1
y y CC 0.999989
ves ve NCFP000 0.959746
a a SP 0.998775
tu tu DP2CSS 1
nena nene NCFS000 1
. . Fp 1

$ echo "A los diez minutos su fisonomía estaba tan variada, que si la ves no la conoces." | analyze -f es.cfg
A a SP 0.998775
los el DA0MP0 0.992728
diez_minutos TM_min:10 Zu 1
su su DP3CSN 1
fisonomía fisonomía NCFS000 1
estaba estar VMII3S0 0.499587
tan tan RG 1
variada variar VMP00SF 1
, , Fc 1
que que CS 0.449861
si si CS 0.999827
la el DA0FS0 0.98926
ves ve NCFP000 0.959746
no no RN 0.999297
la lo PP3FSA0 0.010734
conoces conocer VMIP2S0 1
. . Fp 1
(Here, the first occurrence of 'la' is tagged as DA)

'ves' is sometimes tagged as V and sometimes as NC. Probability is very large in the NC tag. I think -but I could be wrong- that this high probability does not correspond to the frequency in the training corpus nor to structural characteristics of Spanish language.
I know that a complex technology has limits. These comments only try to contribute to its improvement.
Thank you.

The error in "ves" is again for the same reason.

The training corpus is newspapers.  There is not a single occurrence of "ves" in the training data (in fact, few occurrences of verbs in 1st or 2nd person forms), thus the tagger is resorting again to default background probabilities (that for words that have the ambiguity NC-VMI, largely favours the NC option).