Unknown WordNet offsets

Submitted by PaulBreugnot on Wed, 07/18/2018 - 21:41


I am currently using Freeling from Python API to perform sentiment analysis in a research project. More presicely, I use Freeling for tokenization, lemmatization and sense disambiguation.
Freeling gives synsets as WordNet offsets (e. g., '01123148-a'), if I am not wrong. Then, I need to look in the SentiWordNet database for synsets polarities, but to do so I need to represent synsets with their name (e.g. : 'good.a.01') to use them in SentiWordNet, as explained there : http://www.nltk.org/howto/sentiwordnet.html
The WordNet interface of nltk provides a function to do that, of2ss : http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html
Most of the time it works well, but sometimes Freeling comes with synsets that seem to not be found by of2ss, like '80000779-n' for lemma 'array', and this makes the program crash :

Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1363, in synset_from_pos_and_line
columns_str, gloss = data_file_line.split('|')
ValueError: not enough values to unpack (expected 2, got 1)

I installed WordNet-3.0 to check array senses, and no one corresponds to the one returned by Freeling. However, both nltk WordNet Interface and Freeling should use the WordNet 3.0 database according to their documentation.

Any idea about this issue?


Wed, 07/18/2018 - 22:08

Looking in senses30.src, I discovered that there are a lot of offsets in starting from 80000001-a that look what strange.

15299367-n august_6 transfiguration transfiguration_day
15299585-n usance
15299783-n window
15300051-n 9-11 9/11 sep_11 sept._11 september_11
80000001-a nephropathic
80000002-a plasmidic
80000003-a ribosomal
80000004-a lysosomal
80000005-a chloroplastic
80000006-a fibrotic
80000007-a genomic
80000008-a stereochemical
80000009-a genetically_modified
80000010-a transgenic
80000011-a transgenic
80000012-a gametic

I remember that other offsets producing problems where also this king of offsets, so I have reasons to think that of2ss can't find those synsets in WordNet. That's not an huge problem because I can hack my program ignoring those synsets (or even removing them from senses30.src for my app...) and they are not so numerous and don't look so important.
But where those synsets come from?

WNs in FreeLing come from MCR project and thus they may contain enrichments on the original WNs.

So, you should ignore those senses if you want to use a pure WN-30 structure. They are easy to spot because the offset starts with "8"