How to add new possible PoS tags?

Submitted by Xsoto on Thu, 11/16/2017 - 15:11


I'm trying to adapt Freeling 4.0 so it can analyze domain specific dictionary, identified by specific PoS tags. To do so, I have modified the file "tagset.dat" including the new codes; for instance, to recognize the PoS tag "NC00SC0" I have changed the following line on "tagset.dat":

N 2 noun type/C:common;P:proper gen/F:feminine;M:masculine;C:common num/S:singular;P:plural;N:invariable neclass/S:person;G:location;O:organization;V:other nesubclass/0:0;P:0 degree/V:evaluative

For this one:

N 2 noun type/C:common;P:proper gen/F:feminine;M:masculine;C:common num/S:singular;P:plural;N:invariable neclass/S:snomed;P:person;G:location;O:organization;V:other nesubclass/C:snomed;0:0;P:0 degree/V:evaluative

But seems not recognizing "NC00SC0" style tags... do I need to change any other file or do something else so the changes in "tagset.dat" would have any effect?

BTW, I have seen some mismatch in the related part of the user manual, since in the beginning there appear some examples in which the noun PoS tags have 7 characters (page 139) but then in the table describing the meaning of the different characters (also for nouns) the tags have only 6 characters (page 140). The line extracted from "tagset.dat" shown in the end of page 139 is also different from the one in the current code , I suppose it has been changed in some moment but remains in the documentation.



The file "tagset.dat" is used for two things:

1) define which part of the tag is the short version used by the tagger (such two positions e.g. "NC" for nouns, but three positions e.g. "VMI" for verbs

2) convert the tag to morphological features (eg convert "NCMS000" ->  "pos=noun|type=common|gen=masc|num=sing")

The contents of the tagset.dat file must match the tagset used by the dictionary and the tagger (otherwise, the shortening or translation will not work)

If you need a word to get the tag NC00SC0, that word has to have that tag in the dictiionary.
Then, the tagger will use the shortened version (which should be NC) and will work properly.

Also note that the "0" symbols are zeros, not uppercase or lowercase letter "o" .
Zero means that the feature is unspecified