I've been using Freeling for a while and it seems like a great project. What I have used it the most for is to work with named entities. Now I am interested in adding new entities to the recognition, so I wonder if someone here has already done this and could help me a little, I have been looking at the code and it is not as simple as using a corpus labeled with the new entities, you have to change various things in the scripts.
In short, someone from here in the community has already added new entities to the recognition in freeling, could you give me some advice on which files to modify and how?
(Additionally I have another question about how to obtain the necessary elements to create a corpus in full format, specifically lemma, PoS tag and the list of triple lemma-tag-probabilty for all possible word readings.)
To alter the set of classes…
To alter the set of classes, there are indeed a few changes to do to the scripts.
Basically, you should look for the places where the freeling-like tags (NP00G00, NP00SP0, etc) are used or mapped to generic class names (LOC, PER, etc)
This happens in:
corpus/bin/extract-gaz.sh
nec/bin/extract-gaz.sh
Several places mapping one set of labels to the other:
nom["NP00SP0"]="PER"; nom["NP00G00"]="LOC";
nom["NP00O00"]="ORG"; nom["NP00V00"]="MISC";
corpus/bin/nec2full.awk
Same agaain
c["NP00SP0"]="PER"; c["NP00G00"]="LOC" c["NP00O00"]="ORG"; c["NP00V00"]="MISC"
nec/bin/nec-adaboost.sh
nec/bin/nec-svm.sh
Some places with a list of freeling labels for the trainer:
"0 NP00SP0 1 NP00G00 2 NP00O00 3 NP00V00"
Those should be the only things you need to change
You can use FreeLing to produce the Pos tag column. You can use FreeLign with option "--outlv morfo" to produce the list of possible pos tags. You can then use "paste" (or a small python script) to put this toghether. The column with the NE bio class, you'll have to add it manually.