NER | FreeLing Home Page

Submitted by tmyapple on Wed, 02/01/2017 - 13:09

Forums

Hi, I'm trying to use the NER option and will appreciate any help given,

1. I read in the manual that there's an option to use "bio-ner", in case i want to use it, what flags do i need to update in config file?
NERecognition=bio -> didn't work, and what NPDataFile should i give in this case?
2. It seems that the first capital letter is very important for NER, is it possible to retrain the models of english on the same data after applying lowercase on all the data?
3. in case i want to train my own models, can you please direct me to how the input data should be? and a sample recipe to how to do it?

again, any help will be appreciated :)

Tamir

just change the config file

If you are using freeling calling the "analyze" program, you just need to change the NER file in your config file (.e.g. en.cfg)
Look for the lines:
# NER options NERecognition=yes NPDataFile=$FREELINGSHARE/en/np.dat ## --- comment lines above and uncomment one of those below, if you want ## --- a better NE recognizer (higer accuracy, lower speed) #NPDataFile=$FREELINGSHARE/en/nerc/ner/ner-ab-poor1.dat #NPDataFile=$FREELINGSHARE/en/nerc/ner/ner-ab-rich.dat # "rich" model is trained with rich gazetteer. Offers higher accuracy but # requires adapting gazetteer files to have high coverage on target corpus. # "poor1" model is trained with poor gazetteer. Accuracy is splightly lower # but suffers small accuracy loss the gazetteer has low coverage in target corpus. # If in doubt, use "poor1" model.

and comment/uncomment lines as needed

If you don't want to change the default config fle, you can make a copy, or you can simply use the command line option --fnp to specify the NER configuration file.

Yes, uppercase letters are a strong marker of a NE, so the classifier learns to strongly rely on this feature.

If you have a corpus with NE marked, you can retrain the classifier using the scripts and procedures provided with FreeLing. Check README file at:
https://github.com/TALP-UPC/FreeLing/tree/master/src/utilities/train-ne…

training lowercased model

Hi again,
first of all, thank you very much for the quick response, i've played a little with the different flags in the config file..

Have you ever trained and tested "lowercased" models using your own data? I'm trying to figure out how much degredation it might cause, and would happy to hear what you know/think by your experience?

No, we haven't

We have not experimented on simply training a model with all word lowercased, but I suspect that performance would drop dramatically.
The reason is that the model will need to heavily rely on dictionaries (either regular dictionaries or gazetteers) to make its decision, and this is problematic, since:

Gazetters are always incomplete. You can build one for a domain, but there will be always missing names. If you change domain, it gets much worse.
Proper nouns (e.g. surnames) are often also common nouns or adjectives (e.g. "Smith", "Brown", may be surnames). So, the fact that something is in a regular dictionary does not rule out the possibility of it being a proper noun.

There are some works in this direction though. You can have a look to
http://anthology.aclweb.org/P/P15/P15-1013.pdf

Thanks I'll take a look

I tested a lot of NLP packages, and in all the first capital letter have an enormous influence on the NER, which is reasonable(writing the words with lowercase caused FN). Yet, in a lot of systems the users are less strict about their writing - chat apps, e-mails - will probably contain a lot of entities without capital letters..

Is there any option for payment support in the matter? (where the models will be retrained lowercased or with a different feature set but on the same original data)

It is an open problem

NE detection in non-standard text such as chats, tweets, etc., is an open reserach issue.
Just training models with lowercase text will not be enough, because availabe annotated corpus is basically newspaper text. So, the entities that the model would learn will not be useful for other domains.

So, the solution must be in the line of using richer features, latent spaces, and semi- or unsupervised learning.

In summary, it is an unresolved issue. You can put money on it, but with no guarantee of getting results in the short term.