Python API usage for coreference, semantic graph and NERC

Submitted by mvillanueva on Thu, 09/21/2023 - 09:18
Forums

----
Intro
----
Hi, I have been using freeling for a few months now to extract triplets. So far I have succeded in doing so by using the dependency tree and the full parse tree, but I am trying to improve my approach by using coreference, semantic graph and NERC.

----
My work so far
---
I checked the tutorial for python, but I couldn't find anything beyond depdency parsing. So I went through the class list (since the same classes should be available for python and c++) and I decided to use relaxcor, but the analyze method of this class only accepts a document as a parameter.

----
Problems
----
So what I'm asking if anyone can help me with is the following:
1. Document creation problem: How do you create a document using the python API? All examples use a tokenizer then a splitter which returns a list of sentences. I checked some c++ examples provided in the simple_examples folder (https://github.com/TALP-UPC/FreeLing/blob/master/src/main/simple_exampl…) and I tried using the document method "insert" just like the example but after several different attempts I couldn't find a way to provide the right parameters, aparently it expects:
std::list< freeling::paragraph >::insert(
std::list< freeling::paragraph >::iterator,
std::list< freeling::paragraph >::value_type const &)

So i tried the method append. Which didn't throw any errors but I haven't fully checked if this is correct, since the only thing i checked is the number of words and it didn't seem correct, my text was a dummy test of the likes "Sobre la mesa María ve y coge una manzana, un sombrero, una llave y dos paraguas rojo." and after the append method the document had 80 words.
Here is my code so far for this:

# tokenize input line into a list of words
lw = tk.tokenize(text)
print(lw)
# split list of words in sentences, return list of sentences
ls = sp.split(lw)

paragraphs = pyfreeling.paragraph(ls)
# list_paragraphs = pyfreeling.ListParagraph([paragraphs])

doc = pyfreeling.document()
doc.append(paragraphs)

2. Doubt about entities: Going back to the example "Sobre la mesa María ve y coge una manzana, un sombrero, una llave y dos paraguas rojo." I realized that working with capitalized words and lowercase produce different results, but by making it all lowercase the entity recognition stops recognizing "maría" as a person. Is there are workaround for this or am I going in the wrong direction? The main problem is that "maría" not recognized as a named entity (which i need it to be by the way) results in "maría" not being the subject of the sentence anymore. Here is how im getting this:

neclass = pyfreeling.ner(lpath + "/nerc/ner/ner-ab-rich.dat")
ls = morfo.analyze_sentence_list(ls)
ls = tagger.analyze_sentence_list(ls)
ls = sen.analyze_sentence_list(ls)
ls = neclass.analyze_sentence_list(ls)
ls = wsd.analyze_sentence_list(ls)

ls = srl_parser.analyze_sentence_list(ls)
ls = dep.analyze_sentence_list(ls)
ls = parser.analyze_sentence_list(ls)
3. How to retrieve named entities: Kind of a follow up of the previous question, how do I get the named entities? I couldn't find any code related to this.

mvillanueva

Thu, 09/21/2023 - 10:37

I don't need a solution for problem 1 I just managed to create the document. Yet im unnable to retrieve any groups, according to the semantic graph it is 0
doc.get_semantic_graph().get_num_groups()

Problem 1

About document creation:  You are doing it right, create the paragraph from ls, and then the document from the list of paragraphs.  I am not sure using [ ] will create a proper list. 
Keep in mind that Freeling is a C++ library, so you have to adapt to C++ methods, not all python power can be translated into C++ easily.

To get the coreference groups, you need to run relaxcor.  For relaxcor to run properly, you need to have previously run the document through the whole pipeline (tagger, NERC, wsd, parser, SRL, ...)

Also, just running relaxcor will add coreference info to your document, but not a semantic graph.  For that you need to run the semantic graph extractor after relaxcor.

Problem 2&3

There are 3 different NERC modules in freeling, one rule-based and two ML-based. However, all of them use capitalization as a feature, and since both models are trained on standard text, all NEs seen in training are capitalized. So, it is very unlikely that the CRF or the adaBoost model will tag as NE something in lowercase.

Entities are identfied by their PoS tag.  If a word (or multiword) has a pos-tag starting with "NP", then it means it was recognized by the NERC module. No other modules assign this tag.

 

You mentioned that in order to run the relaxcor i need to go through the whole pipeline. Well I am calling the analyzer like this:

doc = pyfreeling.document()
analyzer.analyze(text, doc)

output = pyfreeling.output_freeling()
output.output_senses(False)
output.output_dep_tree(False)
output.output_corefs(False)
output.output_semgraph(True)

And configuring the analyzer like this but i get no coreference output:

# define creation options for morphological analyzer modules
opt.config_opt.Lang = lang
opt.config_opt.TOK_TokenizerFile = f"{data}/{lang}" + "/tokenizer.dat"
opt.config_opt.SPLIT_SplitterFile = f"{data}/{lang}" + "/splitter.dat"
opt.config_opt.MACO_Decimal = "."
opt.config_opt.MACO_Thousand = ","
opt.config_opt.MACO_LocutionsFile = f"{data}/{lang}" + "/locucions.dat"
opt.config_opt.MACO_QuantitiesFile = f"{data}/{lang}" + "/quantities.dat"
opt.config_opt.MACO_AffixFile = f"{data}/{lang}" + "/afixos.dat"
opt.config_opt.MACO_ProbabilityFile = f"{data}/{lang}" + "/probabilitats.dat"
opt.config_opt.MACO_DictionaryFile = f"{data}/{lang}" + "/dicc.src"
opt.config_opt.MACO_NPDataFile = f"{data}/{lang}" + "/np.dat"
opt.config_opt.MACO_PunctuationFile = data + "/common/punct.dat"
opt.config_opt.MACO_ProbabilityThreshold = 0.001

opt.config_opt.NEC_NECFile = f"{data}/{lang}" + "/nerc/nec/nec-ab-poor1.dat"
opt.config_opt.SENSE_ConfigFile = f"{data}/{lang}" + "/senses.dat"
opt.config_opt.UKB_ConfigFile = f"{data}/{lang}" + "/ukb.dat"
opt.config_opt.TAGGER_HMMFile = f"{data}/{lang}" + "/tagger.dat"
opt.config_opt.PARSER_GrammarFile = f"{data}/{lang}" + "/chunker/grammar-chunk.dat"
opt.config_opt.DEP_TxalaFile = f"{data}/{lang}" + "/dep_txala/dependences.dat"
opt.config_opt.DEP_TreelerFile = f"{data}/{lang}" + "/treeler/dependences.dat"

opt.config_opt.MACO_CompoundFile = f"{data}/{lang}" + "/compounds.dat"
opt.config_opt.COREF_CorefFile = f"{data}/{lang}" + "/coref/relaxcor_constit/relaxcor.dat"

opt.invoke_opt.TAGGER_ForceSelect = pyfreeling.RETOK
opt.invoke_opt.SENSE_WSD_which = pyfreeling.UKB
opt.invoke_opt.TAGGER_which = pyfreeling.HMM
opt.invoke_opt.DEP_which = pyfreeling.TXALA
opt.invoke_opt.SRL_which = pyfreeling.TREELER
opt.invoke_opt.InputLevel = pyfreeling.TEXT
opt.invoke_opt.OutputLevel = pyfreeling.SEMGRAPH

# choose which modules among those available will be used by default
opt.invoke_opt.MACO_UserMap=False
opt.invoke_opt.MACO_AffixAnalysis = True
opt.invoke_opt.MACO_MultiwordsDetection = True
opt.invoke_opt.MACO_NumbersDetection = True
opt.invoke_opt.MACO_PunctuationDetection = True
opt.invoke_opt.MACO_DatesDetection = True
opt.invoke_opt.MACO_QuantitiesDetection = True
opt.invoke_opt.MACO_DictionarySearch = True
opt.invoke_opt.MACO_ProbabilityAssignment = True
opt.invoke_opt.MACO_CompoundAnalysis = False
opt.invoke_opt.MACO_NERecognition = True
opt.invoke_opt.MACO_RetokContractions = False

opt.invoke_opt.NEC_NEClassification = True
opt.invoke_opt.PHON_Phonetics = False

Relaxcor works better on the output of the lstm parser, not txala parser (lstm parser is much more accurate)

So, you need to set:
opt.config_opt.DEP_LSTMFile = f"{data}/{lang}" + f"/dep_lstm/params-{lang}.dat"
opt.invoke_opt.DEP_which = pyfreeling.LSTM

And since you are using LSTM parser and  not the chart parser neither Txala parser, the following options are not necessary:
opt.config_opt.PARSER_GrammarFile = f"{data}/{lang}"+ "/chunker/grammar-chunk.dat"
opt.config_opt.DEP_TxalaFile =  f"{data}/{lang}" + "/dep_txala/dependences.dat"

You also need to provide a config file for the SRL
opt.config_opt.SRL_TreelerFile =  f"{data}/{lang}" + " treeler/srl.dat"

as well as for the semgraph extractor
opt.config_opt.SEMGRAPH_SemGraphFile =  f"{data}/{lang}" + "/semgraph/semgraph-SRL.dat"


You can check file /usr/share/freeling/config/es.cfg to see which options are available for each module, and which config files may be set for each option.

Thank you for your quick response.
When i use relaxcor_dep i get the following message:
RELAXCOR_FEX: Requested feature RCF_DEMONYM not implemented. It will be ignored.
RELAXCOR_FEX: Requested feature RCF_LOC_MATCH not implemented. It will be ignored.

but if I change it to relaxcor_constit i get a MENTION_DETECTOR: Error: document is not parsed.

The first two are warnings, you can ignore them

relaxcor_constit is to be used if the parser was a constituency parser. (i.e the "PARSER" module in freeling, which I don't recommend since it is rule based and rather obsolete).

Since you are using LSTM parser, which is a dependency parser, you need to use relaxcor_dep