No rule to get short versions of chunk tags (English)

Submitted by PaulBreugnot on Wed, 08/08/2018 - 17:19

I am using Freeling to process English sentences, more particularly using hmm_tagger.
At the beginning I add a lot of errors "TAGSET: No rule to get short version of tag [tag]" where tag was of the form "tag1+tag2" (e.g. : 'DT+MD', 'VB+PRP'... It mainly corresponds to all the English abbreviation I'll, We'd, you're...).
Even if they don't make the program crash, those errors was just overflowing the terminal, slowing the program, and it seems that those PoS tags was not assigned to corresponding words.
So I decided to re-write a "tagset.dat", adding the missing rules, doing things like :

DT+MD DT+MD pos=determiner+verb|type=modal
MD+RB MD+RB pos=verb+adverv|type=modal+general
MD+RB+VBP MD+RB+VBP pos=verb+adverb+verb|type=modal+general|vform=personal

(all the tags come from a search in dicc.src)
I am not a linguistic expert and I have no idea about how consistent are the features that I assigned, but this fixed the problems for the major part of the tags.

However, I have still errors with those tags :

TAGSET: No rule to get short version of tag 'PRP+VBD'.
TAGSET: No rule to get short version of tag 'PRP+MD'.

But in my tagset.dat, I have :

PRP+MD PRP+MD pos=pronoun+verb|type=personal+modal
PRP+VBD PRP+VBD pos=pronoun+verb|type=personal|vform=past
PRP+VBD/MD PRP+VBD/MD pos=pronoun+verb|type=personal+modal|vform=past

Notice that when I remove

PRP+MD PRP+MD pos=pronoun+verb|type=personal+modal


PRP+VBD PRP+VBD pos=pronoun+verb|type=personal|vform=past

I have other errors showing up, otherwise it's like pairs of

TAGSET: No rule to get short version of tag 'PRP+VBD'.
TAGSET: No rule to get short version of tag 'PRP+MD'.

that are always showing up, what make me think that there is a problem with the PRP+VBD/MD tag.

Any idea about how to solve this?


Wed, 08/08/2018 - 19:41

I think I found the origin of the problem, and modifying the tagset.dat was clearly not the good approach.
If I understood well the documentation of Dictionary Search Module, PoS tag in "tag1+tag2+..." are actually some rules to apply a PoS tagging to abbreviations, what occurs after a retokenization of abbreviations (that can be done with the Morphological Analyzer for example) but tags "tag1+tag2+..." are not supposed to be used in their raw form, isn't it?
And, because of other needs of my project, I didn't want to process a retokenization, what caused those errors.

You are right.  These tags tag1+tag2 are splitted by the morphological analyzer and should not reach the tagger.

If you don't want the retokenization, you can use option --nortkcon (which will prevent the morphological analyzer from retokenizing) AND --nortk (which will prevent the tagger from retokenizing when detecting a compound tag).

You'll still get the warnings, but they are written to stderr, so you can just redirect them to /dev/null or the like.