pt NER extraction

Submitted by break on Wed, 04/12/2017 - 16:57
Forums

Hello Freeling friends, i have a little problem running NER / NEC.
I'm trying to get detections about persons, organizations, places, dates etc, with a given plain text. As far as i know the text must be sentence splited (?not sure) and tokenized (one word per line), however i'm trying to run analyze with a plain text that have nouns, places, organizations etc, and see whats the output, but i got an error on np.dat file, maybe i'm calling it wrongly.
The config file has NER and NEC enabled:


#NER options
NERecognition=yes
NPDataFile=$FREELINGSHARE/pt/np.dat
#..

## NEC options
NEClassification=yes
NECFile=$FREELINGSHARE/pt/nerc/nec/nec-ab-poor1.dat
#..

then i call:
analyze -f pt.cfg --ner -N np.dat --nec input.txt > Ner.txt

And got that error:
NP: Error opening file np.dat

I check if i needed the nec classifier file too:
analyze -f pt.cfg --ner -N np.dat --nec --fnec nec-ab-poor1.dat input.txt > Ner.txt

i've got the same error.

Please if someone could help me, i would appreciate a lot.
Thank you!

NP: Error opening file np.dat
Probably means that the file could not be found.
If you use option "-N np.dat", the file np.dat is expected to be found in the current working directory (cwd)
If it is not in the cwd, then you should provide the path: "-N /path/to/my/np.dat"

If you want to use the default np.dat, you don't need to add any option, since it is already configured in the default config file pt.cfg. NER is already enabled by default.

So, if you run simply "analyze -f pt.cfg" you will get pos tagging with NER.
If you add option "--nec" you will get also NEC using the model specified in config file. You only need to use options -N and --fnec if you want to use files different than those provided by pt.cfg

Thank you lluisp for the fast answer. I've done some changes to the file nec-ab-poor1.dat :

0 Pessoa 1 Localizacao_Geografica 2 Organizacao 3 Outro

the output:

Avenida_AEP avenida_aep Outro 1
, , Fc 1
em em SP 1
o o DA0MS0 1
Porto porto Localizacao_Geografica 1
. . Fp 1

o o DA0MS0 0.950254
jornal jornal NCMS000 1
JN jn Organizacao 1
, , Fc 1
António_Campos antónio_campos Pessoa 1
, , Fc 1
chefe chefe NCMS000 1
de de SP 1
segunda segundo AO0FS00 0.944444
classe classe NCFS000 1
de de SP 1
o o DA0MS0 0.950254 o PD0MS00 0.0313531 o PP3MSA0 0.0183743 o NCMS000 1.85412e-05
Batalhão_de_Sapadores_Bombeiros_do_Porto batalhão_de_sapadores_bombeiros_do_porto Organizacao 1
. . Fp 1

Em em SP 1
o o DA0MS0 1
sentido sentido NCMS000 0.988189
Porto-Matosinhos porto-matosinhos Localizacao_Geografica 1
. . Fp 1

IFP ifp Organizacao 1
e e CC 0.999951
IPF ipf Organizacao 1
. . Fp 1

Joao_Santos joao_santos Pessoa 1
que que PR0CN00 0.584265
vive viver VMIP3S0 0.923762
em em SP 1
Santos santos Localizacao_Geografica 1
, , Fc 1
Brasil brasil Localizacao_Geografica 1
. . Fp 1

Em em SP 1
o o DA0MS0 1
local local NCFS000 0.41142
estiveram estar VMIS3P0 0.65
os o DA0MP0 0.958022
Sapadores sapadores Organizacao 1
apoiados apoiar VMP00PM 0.962537
por por SP 1
os o DA0MP0 1
Bombeiros_de_Leixões bombeiros_de_leixões Organizacao 1
, , Fc 1
segundo segundo SP 0.636259
testemunho testemunho NCMS000 0.914011
de de SP 1
o o DA0MS0 0.950254 o PD0MS00 0.0313531 o PP3MSA0 0.0183743 o NCMS000 1.85412e-05
chefe chefe NCMS000 1
José_Luís josé_luís Pessoa 1
. . Fp 1

Terça-feira [M:??/??/??:??.??:??] W 1
até até SP 0.764262
Sábado [S:??/??/??:??.??:??] W 1
ou ou CC 1
Domingo [G:??/??/??:??.??:??] W 1
. . Fp 1

My question is, where can i fine tune the output? i want to output only the NER tags with the correspondent word like:
- JN -> Organizacao . . . António Campos -> Pessoa
Other questions is, can i relate Antonio to JN with the space between the two words? I mean if the words are closer, that exist the probability that they are related?

I'm sorry for all that questions, i'm really enjoying work with Freeling, and i appreciate all your help!
Thank you!

If you want only entities, the easiest way is filtering freeling output (e.g. with grep, or with a simple python script) to extract only what you want.
Alternatively, you can modify the code in the analyzer program that outputs the sentence (though I'd say it's not worth the trouble).

Regarding the relation between two entities given their distance, I am not sure that it makes much sense, but you can use it as a baseline.

So, given the freeling output, make just a simple program or script that run into that output and extract only what i want. Great, i'm going to do that.
I'll keep in touch. Thank you once again.

Hello Freeling users , i've done the next python program to extract the line that i want with the tags that i'm looking for:

with open("outputFrl-NER.txt") as f:
for line in f:
if "Organizacao" in line:
print line
if "Localizacao_Geografica" in line:
print line
if "Pessoa" in line:
print line

I have another question, can i recognize money or time, or even days like "Sexta-feira"?
Thank you! hope that helps someone!

Yes, freeling can recognize time expressions (dates, time of day, combinations of them) and quantified magnitudes (money, but also speed, weight, distances, density, etc).

It may be faster to simply try it (or read the manual) before asking....

I've read the manual, and check that it can recognize dates, times of day etc.. but i was a little confused when i saw the output

...
terça-feira [M:??/??/??:??.??:??] W 1
até até SP 0.764262
Sábado [S:??/??/??:??.??:??] W 1
ou ou CC 1
Domingo [G:??/??/??:??.??:??] W 1
...
oferecer oferecer VMN0000 0.529581
25.000_€ $_ECU:25000 Zm 1
...
100_euros $_ECU:100 Zm 1
..
24:26 24:26 Z 1
...
EDP edp Organizacao 1
...
Lisboa lisboa Localizacao_Geografica 1
...
Srº_João_Luís srº_joão_luís Pessoa 1
...
Bombeiros_de_Lisboa bombeiros_de_lisboa Localizacao_Geografica 1
. . Fp 1

Now i got it, ill check if the line has "W" day of week or "$_ECU" for money or "Z" for time

Thank you for your help!

$_ECU means "money in euros". If the unit is dollars, you'll get $_USD. If it is pounds, you'll get $_GBP, etc
It is safer to check for "Zm" in the third column

For date/time expressions, "W" is the tag to look for. Not only weekdays, but e.g. "Thursday may 3rd, 2014" is also a date expression that will be marked as "W"

"Z" alone means number. That means it could not establish whether 24:26 was a time, a ratio, or a basketball score because of lack of context. (btw, 24:26 is not a valid time of day. It is either 12:26 or 00:26).
Try with, for instance "a las 12 e 26 minutos" or "Sabado 24 de febreiro a las 23:12".