Unexpected differences between output format "tagged" and "morfo"

Submitted by koichi on Sun, 01/21/2018 - 17:17

Dear developer,

Will you check these two results? It looks like we can obtain better results with "morfo" output format.


Here is the test sentence:

Marilla lighted a candle and told Anne to follow her, which Anne spiritlessly did, taking her hat and carpet-bag from the hall table as she passed.

When I don't want to use NER, is it better to use "morfo" than "tagged" output format?

Thank you!

If you deactivate NER, proper nouns are not recognized, so you will not get a proper analysis for "Anne".

In any case, "morfo" is not better than "tagged".   "morfo" highest output is equivalent to a unigram tagger, thus, context independent. That is, the word "walk" would get always the tag "VB", regardless of whether the sentence is "I walk every day" or "I took a long walk".
On the other hand, "tagged" output will vary depending on the sentence.

So, to summarize: If your text has proper nouns, you should not deactivate NER if you expect it to work.

Thank you for your reply!

Then, I would use NER. That's OK.

But can I disable multiwords detection of NER? Because NER extracts too long words for me. For example, it extracts "anne_of_green_gables" as a word from a sentence below.

But it's a million times nicer to be Anne of Green Gables than Anne of nowhere in particular, is not it?

I would like to extract and count "Anne" as a word, not "anne_of_green_gables". Will you give me any possible solutions?


You can tune the NER behaviour in  the np.dat configuration file

For instance, you could remove "of" from the list of function words that are allowed into a named entity.  Then, you would get "Anne" and "Green_Gables".  (Though for a text such as "Bank of External Commerce" you would get "Bank" and "External_Commerce"... so... you need to weight pros and cons)

See user manual to find out the details and available options