train-nerc

Submitted by tmyapple on Tue, 04/25/2017 - 06:52
Forums

Hi,
I'm trying to train a NER model using the train-nerc directory and the demo data that exist in train-nerc/corpus
I'm following the scripts, and encountered several problems:
1. corpus/bin/extract-gaz.sh (is being called from prepare-corpus) - is stuck inside the second loop, looks like the achieved ratio doesn't achieve the goal ratio - i suspect that it is due to the small volume of the demo set. Should it work fine and something else is wrong on my side?
2. ner/bin/encode-corpus.sh - in the readme it is written that there are only 2 arguments needed, language and features. But in the script it needs also gaz-type (which is written in the README as needed for NEC and not NER). without a gaz-type the gaz features are getting a path that doesn't exist because of "sed "s/\(gaz.*-[cp].dat\)/\1.$gz/"- gz missing means that the filename will have "." after the extension. And even without it, the files exist in en/nerc/data after using extract-gz doens't match the names in the features files. i.e. in the features it looks for the files gazPER-p.dat, but it does not exist and i have gazPER-p.dat.rich.train?
so, does it means that the gaz-type is actually a must and what is written in the README is not up to date?
actually i have more questions, but it keeps thinking i'm spamming so i'll post it along the way...
many thanks
Tamir

in line 52:
$CORPUS/$LG.train.$feat.enc - i don't have such files the files i get because i used the gaz-type and got something like :en.f81.rich.train.enc, The order of the arguments in the name is different, and here it looks like that i shouldn't give a gaz-type for the encode-corpus of the NER.
I think it is obvious that i'm a little bit confused regarding the arguments, especially the gaz-type, but also as one can see the structure of the enc file name is different than the one given as an argument in line 52

any help will be very much appreciated..
Thank you

First of all, be aware that those scripts have not been updated since around version 3.0... So, there may be some things that neeed common sense to fix them.
Also, note --as the readme says-- that these scripts are not final-user oriented, and require machine learning knowledge to understand what they do and tune them to your needs.

1. corpus/bin/extract-gaz.sh (is being called from prepare-corpus) - is stuck inside the second loop, looks like the achieved ratio doesn't achieve the goal ratio - i suspect that it is due to the small volume of the demo set. Should it work fine and something else is wrong on my side?

Yes, it may be due to the small size of the corpus. Just change the value for the target ratio and it should work.

2. ner/bin/encode-corpus.sh - in the readme it is written that there are only 2 arguments needed, language and features. But in the script it needs also gaz-type (which is written in the README as needed for NEC and not NER).

Yes, the third parameter is needed, probably an out-of-date readme.

And even without it, the files exist in en/nerc/data after using extract-gz doens't match the names in the features files. i.e. in the features it looks for the files gazPER-p.dat, but it does not exist and i have gazPER-p.dat.rich.train?

The generated files are gazetters that your final model may use or not. You will need to write a configuration file for your NER module, and this file will specifiy which feature extraction rules (.rgf file) to use, and these rules will use (or not) some gazetter. The tyipical ML development loop would be create different gazetteers and feature sets, evaluate the performance with each combination, and finally choose one of them (which is the one you will use in your default configuration files).

So, the file names, do not work out-of-the-box. You are suposed to select which you want to use, and set the right configurations.

Hi lluisp,
Thank you for the quick response, I appreciate your help,

I've been able to run svm train after some modifications of the scripts.
I had to change ner-svm to get another argument - gz, and also change the argument "train-svm" gets from:
$CORPUS/$LG.$feat.train.enc
to
$CORPUS/$LG.$feat.$gz.train.enc

Later on, i saw that "ner-adaboost" is actually written like the modifications i did for ner-svm, Therefore I guess that the ner-svm.sh isn't up to data as the ner-adaboost.

If you think i'm right I would be happy to push those changes and contribute to the project

yes, you are probably right.
You're very welcome to push the changes.

thanks!

So after successfully running the training scripts, and the test scripts, and even get some nice results for the demo (although it is a very small data set), I moved forward to try the adaboost.

It looks like I trained a model, but the model tag all the words with "B", I am not sure if something got wrong or it is a matter of data size.
the output file i get looks like:
B B B B ...
B B B B ...
B B B B
B B B B
B B B B
B B B B
B B B B
B B B B
...

Instead of getting on column, i get 50 columns tag for each word ?! Any idea what might cause it?

Each column is the output of the model after one iteration, so you can compute a learning curve.
If you add more iterations (i.e. more weak rules to learn) you'll get more columns.

the idea is to find the minimum number of weakrules that produces a reasonable result (i.e. when learning curve reaches a plateau) and cut the model file after that number of rules.

Thank you again, I really appreciate your help, and you dedication to this project,

I think that for adaboost it is reasonable that i will need much more data than for the SVM, if so that might explain why i get those weird results ("B" for all tags for all the iterations?!)

Is that what i should expect in for the demo training/testing? - or something got extremely wrong in my training

lack of data would probably result in random results. All "B" is not random at all, so I'd say there is a problem somewhere.
"B" is class zero, so it may be that the classifier is not classifying anything at all.
Have a look at the learned model (*.abm file). Does the content look like something learned? or it is just empty?

Looks like something was learned - here is the first few lines:
mlDTree
---
+ 94
+ 2
- -2.78392 -2.78392 2.78392
+ 1
+ 83
- 0.818603 -0.818603 -0.818603
- -0.560903 -0.560903 0.560903
- -1.37019 -1.37019 1.37019
+ 85
+ 181
+ 268
- 1.76503 -1.76503 -1.76503
- -0.560903 -0.560903 0.560903
- -0.818603 -0.818603 0.818603
- -1.29846 1.29846 -1.29846
---

Now that my results look a little different than i first thought,
for some of the test segments it is not always "B"
but it is constant value like "O" or "B" for all the column, and when it is changed, it is changed for all the lines in the column
i.e.:
O O O B B B
O O O B B B
O O O B B B
O O O B B B
O O O B B B
...

yes, that file looks right... But you are right, it doesn't make much sense that all words change to "B" at the same time.
It may be a wrong rule it learned, maybe due to very small training data... (?)