Unknown WordNet offsets | FreeLing Home Page

Submitted by PaulBreugnot on Wed, 07/18/2018 - 21:41

Forums

Hello,

I am currently using Freeling from Python API to perform sentiment analysis in a research project. More presicely, I use Freeling for tokenization, lemmatization and sense disambiguation.
Freeling gives synsets as WordNet offsets (e. g., '01123148-a'), if I am not wrong. Then, I need to look in the SentiWordNet database for synsets polarities, but to do so I need to represent synsets with their name (e.g. : 'good.a.01') to use them in SentiWordNet, as explained there : http://www.nltk.org/howto/sentiwordnet.html
The WordNet interface of nltk provides a function to do that, of2ss : http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html
Most of the time it works well, but sometimes Freeling comes with synsets that seem to not be found by of2ss, like '80000779-n' for lemma 'array', and this makes the program crash :
Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1363, in synset_from_pos_and_line columns_str, gloss = data_file_line.split('|') ValueError: not enough values to unpack (expected 2, got 1)
I installed WordNet-3.0 to check array senses, and no one corresponds to the one returned by Freeling. However, both nltk WordNet Interface and Freeling should use the WordNet 3.0 database according to their documentation.

Any idea about this issue?

80000... synsets

Looking in senses30.src, I discovered that there are a lot of offsets in starting from 80000001-a that look what strange.
... 15299367-n august_6 transfiguration transfiguration_day 15299585-n usance 15299783-n window 15300051-n 9-11 9/11 sep_11 sept._11 september_11 80000001-a nephropathic 80000002-a plasmidic 80000003-a ribosomal 80000004-a lysosomal 80000005-a chloroplastic 80000006-a fibrotic 80000007-a genomic 80000008-a stereochemical 80000009-a genetically_modified 80000010-a transgenic 80000011-a transgenic 80000012-a gametic ...
I remember that other offsets producing problems where also this king of offsets, so I have reasons to think that of2ss can't find those synsets in WordNet. That's not an huge problem because I can hack my program ignoring those synsets (or even removing them from senses30.src for my app...) and they are not so numerous and don't look so important.
But where those synsets come from?

WNs in FreeLing come from…

WNs in FreeLing come from MCR project and thus they may contain enrichments on the original WNs.

So, you should ignore those senses if you want to use a pure WN-30 structure. They are easy to spot because the offset starts with "8"

Yes that's what I did. Thank…

Yes that's what I did. Thank you very much for those informations!