Different tag results between two very similar sentences

Submitted by antodellanzo on Sun, 05/03/2020 - 19:39
Forums

Hello. I'm using freeling in python to analyze a set of newspaper articles in Spanish. I've updated the location gazetteers with a bunch of locations and I've removed all locations from other gazetteers (i.e. organization) because it would return some location as ORG and all I need are locations. I'm analyzing my text with tokenizer, splitter, ner-ab-rich.dat NER and finally nec-ab-rich.dat NEC (all with the default configuration, I haven't changed them). I'm having a problem with the two following sentences:
1- Título: VIRUS ZIKA - BRASIL: (SP) PRIMER CASO AUTOCTONO.
2- Título: ZIKA - BRASIL: (03) MICROCEFALIA, MUERTES, ACTUALIZACIÓN.
When feeling analyzes the first sentence, it returns the tag NP00G00 for brasil (being correctly a location) but for the second sentence it returns the tag NCMS000 for brasil.
So my question is why this could happen as both runs with the same configuration and are similar, but returns different results. And another question is if there's a way that I give higher priority to location gazetteers so I don't have to remove locations from ORG gazetteers or if it's right that I remove them.

Let me know if there's anything more that I can tell you about my configuration.

In my installation (latest github master), I get NP00O00 in both cases:

Título: VIRUS ZIKA - BRASIL: (SP) PRIMER CASO AUTOCTONO.
Título título NCMS000 1
: : Fd 1
VIRUS virus NCMN000 1
ZIKA zika NCFS000 1
- - Fg 1
BRASIL brasil NP00O00 1
: : Fd 1
( ( Fpa 1
SP sp NP00G00 1
) ) Fpt 1
PRIMER 1 AO0MS00 1
CASO caso NCMS000 0.999445
AUTOCTONO autoctono NCMS000 1
. . Fp 1

Título: ZIKA - BRASIL: (03) MICROCEFALIA, MUERTES, ACTUALIZACIÓN.
Título título NCMS000 1
: : Fd 1
ZIKA zika NCFS000 1
- - Fg 1
BRASIL brasil NP00O00 1
: : Fd 1
( ( Fpa 1
03 3 Z 1
) ) Fpt 1
MICROCEFALIA microcefalia NCFS000 1
, , Fc 1
MUERTES muerte NCFP000 1
, , Fc 1
ACTUALIZACIÓN actualización NCFS000 1
. . Fp 1

 

FreeLing NERC systems are all based on machine learning. That means that they learn from annotated data, and try to reproduce the learned behaviour in the new data.

Gazetters are used by the algorithm only as additional features. That is, clues that are mixed with many other context clues to decide a tag for the word. The presence or absence of a word in a gazetteer does not determine the output, since each decision depends on a lot of small clues in that particular word context. Thus, modifying the gazetteers may be useful in some cases, but does not guarantee a right output.

In any case, ML systems are not 100% accurate. There will be always GEOs labelled as ORG and viceversa. You need to live with that, and accept a percentage of error.

Some options you have:

* You can try to fix FreeLing output using your domain knowledge (e.g. if you have a list of countries, you can use it to post-process FreeLing output changing ORGs to GEOs when needed)

* You can try to use latest github version, with includes a CRF NERC module that has better performance than the NERC modules in v4.1.

* You say "I'm analyzing my text with tokenizer, splitter, ner-ab-rich.dat NER and finally nec-ab-rich.dat NEC" ... Please note that NERC modules will not perform properly unless the sentence has been previously PoS tagged .