retraining ner using freeling corpus

Submitted by UDL on Thu, 11/19/2020 - 22:58
Forums

Hello, I would like to extend the classification of named entities of Freeling, improving the ones that already exist and adding some new ones, where can I get the corpus with which the model that comes by default in Freeling was trained? the language in which I want to do this is in Spanish.

FreeLing is trained on public corpora.

NERC is trained with CoNLL 2002 shared task data: https://www.clips.uantwerpen.be/conll2002/ner/

 

Thank you very much, so what you want to tell me is that, for example, freeling for the recognition of entities in Spanish used the data from this corpus (https://www.clips.uantwerpen.be/conll2002/ner/data/esp.train )
That's 270k lines, I guess that's enough. Then I need to transform this corpus to full corpus format before processing, right?
Now a question, if I wanted to add new classes to the classification of named entities, could I use that same corpus with more elements to keep the previous classifications?

Thank you very much, so what you want to tell me is that, for example, freeling for the recognition of entities in Spanish used the data from this corpus (https://www.clips.uantwerpen.be/conll2002/ner/data/esp.train )

yes, that's it


That's 270k lines, I guess that's enough.

Enough or not, it is what is available. I think there is a train set, a development set, and a test set, so the total may be more than that

Then I need to transform this corpus to full corpus format before processing, right?

Yes, it should be put in the format expected by the training scripts.


Now a question, if I wanted to add new classes to the classification of named entities, could I use that same corpus with more elements to keep the previous classifications?

Yes, as long as you go over the whole corpus  and manually mark any instances of your new classes as B-NEWCLASS or I-NEWCLASS

I am not sure whether you may need to change something in the training script to make it aware of which are the expected classes (you can check that with a quick look at the script).

 

 

Thank you very much for the help so quickly, I will take all that into account.

I will ask you because it seems that you have experience in the matter but in case you do not know how to answer me, what I originally planned was to do another post. Wouldn't it be possible to define in Freeling a set of rules that help in the classification? for example, if you wanted to add a new entity called Installation, like hospitals, museums, etc. , where the structure would be [word] of [Named entity] and [word] belongs to a list of words such as (hospital, factory, building, etc.) and the set of words that this structure has would be classified as Installation. Do you think there is a way to do something like this?

FreeLing does not provide that functionality out of the box.

However, you can use FreeLing to process a text, tokenize it, get the PoS of each word, identify NEs, etc. and then write a small program that detects the patterns  you are interested in using the information in the processed text.

That is the main goal of FreeLing: provide a tool that converts text into structured data, so you can use it for your desired application (extracting patterns in your case).

One possible workaround  could be using the chart parser: You could define a simple grammar that  identifies the patterns you are interested in (check the user manual for chart_parser module).
Since the parser is devised to perform shallow syntactic analysis, depending on the pattterns you want to find, it may not have the right expression power.. In any case, it may be worth a try.

 

 

I've been reading about the chart_parser module and I still don't really understand it at all, Freeling is an excellent project but unfortunately the documentation still needs to improve a bit. I will continue investigating to see if I can understand how to use it well. I wanted to ask you a question, do you know if there is a script that converts the CoNLL 2002 corpus (https://www.clips.uantwerpen.be/conll2002/ner/data/esp.train) to full corpus format to be able to do the training with the "prepare-corpus.sh" script?

I've been reading about the chart_parser module and I still don't really understand it at all,

 A chart parser is a parser that expects a CFG grammar.  The patterns you suggested (e.g. "(hospital|bank|government) of NE ") can be described with a CFG grammar.

You should not use this module until you understand what a CFG grammar is and can write one.

 

Freeling is an excellent project but unfortunately the documentation still needs to improve a bit.

  FreeLing is free software, developed with very low resources, and provided "as is".
  You are welcome to contribute anytime.

do you know if there is a script that converts the CoNLL 2002 corpus to full corpus format to be able to do the training with the "prepare-corpus.sh" script?

 There is none, but shouldn't be difficult to build one, since it is a similar format, and FreeLing produces it (e.g. if you want to add pos tag columns to the file)