Using List of sentences provided by Dependecies to Output_ConLL

Submitted by break on Wed, 02/08/2017 - 18:43
Forums

Hello, i'm searching for some help of getting a ConLL format for my dependecies output.
I'm running a pipeline with an input data that passes through Morph → Tagger → Chunker → Dependencies, on dependencies i got the output with the particular format of dependencies, however i need to pass it to ConLL format, so i've create a method that, with the given list of sentences (ls) provided by dependencies and printed there, used the output_conll.cc to create an instance to use output_conll::PrintResults (wostream &sout, const list &ls) function, but it's not working, i want to know if i can use the same 'ls' present on dependencies and call after PrintResults (ls); the conll->PrintResults(ls); or should i pass all the output dependencies file to output_conll::PrintResults(wostream &sout, const document &doc) ? or am i using it wrongly, and it need a certain type of format to call that method?

Thanks.

If you are using "analyzer", you just need to set option "--output conll"

If you are using your own main program to call FreeLing, you don't need to print them in one format and then convert it. You can print them directly in conll using the output_conll class.

The basic working of output_conll is very simple:
// create a output_conll handler
freeling::io::output_conll myout;
 
// assuming that "ls" contains a list of senteces that have
// been properly processed and parsed, you can print them
// in conll format simply calling PrintResults:
myout.PrintResults(wcout, ls)
 
// if you want to print results to a file instead of stdout,
// you can replace "wcout" with any other wostream
// (e.g. an open file)

That will print the sentences in a default format.
You can choose which columns you want to get and in which order they should appear providing a configuration file to the output_conll constructor.
Details on the config file are provided in the user manual:
https://talp-upc.gitbooks.io/freeling-user-manual/content/modules/io.ht…

First of all, i want to thank you for the help, and fastest response, it really helped me.
I'm working with fl files (fl1,..,fl4_dependencies) that is my pipeline.
The corpus is already trained to my language, and the input file is formated to UTF8, so i've just feed the next fl# files with the previous output, using the pipeline described on my first post.
However the input file to fl4_dependences is the output of fl1_tagger.

I edited the fl4_Dependences to output ls to conll format on while (std::getline (std::wcin, text)) and after that while, to process last sentence in buffer if there's any.

I think i got the correct output, however i got some modnorule, like:

1 ... VMIF3S0 - - - - (grup-verb:2(verb:2 2 modnorule - -
2 ... VMN0000 - - - - (infinitiu:2(inf:2))) 0 top - -
... modnorule --

is it normal, it that because of the format of sentences list?
Many Thanks for the help!

The default behaviour of output_conll includes a column with chunking output (if you used det_txala parser). That is the column with "grup-verb" etc.

The two last columns are the dependency parsing (head and function respectively).
"modnorule" means that the parser didn't have any rule to label that dependency (dep_txala is a rule-based parser). The reason may be a strange or ungrammatical sentence.

In fact, if you use "analyze" command with the same input, you'll see how the output includes "modnorule" too.

You can get more robust results if you use dep_treeler parser instead of dep_txala.

Hello lluisp, thank you once again.
I think im using the del_treeler already, because i call it when i run fl4_dependences.cc with the output of fl1_tagger:

./fl4_dependences /usr/local/share/freeling/pt/chunker/grammar-chunk.dat /usr/local/share/freeling/pt/dep_treeler/dependences.dat TryOutCONLL

Im not using the analyse command at all. the question is that got many many "modnorule", the ones that isn't "modnorule" is "top".

Check one simpel output that o got: ( with dependences and ConLL Format )


espec-ms/top/(este este DD0MS0) [
grup-verb/modnorule/(é ser VMIP3S0)
sn/modnorule/(ficheiro ficheiro NCMS000) [
espec-ms/modnorule/(um um DI0MS0)
]
sp-de/modnorule/(de de SP) [
sn/modnorule/(teste teste NCMS000)
]
sadv/modnorule/(aqui aqui RG)
grup-verb/modnorule/(temos ter VMIP1P0)
sn/modnorule/(texto texto NCMS000) [
espec-ms/modnorule/(um um DI0MS0)
]
prel/modnorule/(que que PR0CN00)
grup-verb/modnorule/(servir servir VMN0000) [
VMIP3S0/modnorule/(vai ir VMIP3S0)
]
sp-de/modnorule/(de de SP) [
sn/modnorule/(teste teste NCMS000)
]
grup-sp/modnorule/(para para SP) [
sn/modnorule/(programa programa NCMS000) [
espec-ms/modnorule/(este este DD0MS0)
n-fs/modnorule/(. . NCFS000)
]
]
]

1 este este DD0MS0 - - - - (espec-ms:1(dem-ms:1) 0 top - -
2 é ser VMIP3S0 - - - - (grup-verb:2(verb:2)) 1 modnorule - -
3 um um DI0MS0 - - - - (sn:4(espec-ms:3(indef-ms:3)) 4 modnorule - -
4 ficheiro ficheiro NCMS000 - - - - (grup-nom-ms:4(n-ms:4))) 1 modnorule - -
5 de de SP - - - - (sp-de:5 1 modnorule - -
6 teste teste NCMS000 - - - - (sn:6(grup-nom-ms:6(n-ms:6)))) 5 modnorule - -
7 aqui aqui RG - - - - (sadv:7) 1 modnorule - -
8 temos ter VMIP1P0 - - - - (grup-verb:8(verb:8)) 1 modnorule - -
9 um um DI0MS0 - - - - (sn:10(espec-ms:9(indef-ms:9)) 10 modnorule - -
10 texto texto NCMS000 - - - - (grup-nom-ms:10(n-ms:10))) 1 modnorule - -
11 que que PR0CN00 - - - - (prel:11) 1 modnorule - -
12 vai ir VMIP3S0 - - - - (grup-verb:13(verb:13 13 modnorule - -
13 servir servir VMN0000 - - - - (infinitiu:13(inf:13)))) 1 modnorule - -
14 de de SP - - - - (sp-de:14 1 modnorule - -
15 teste teste NCMS000 - - - - (sn:15(grup-nom-ms:15(n-ms:15)))) 14 modnorule - -
16 para para SP - - - - (grup-sp:16(prep:16) 1 modnorule - -
17 este este DD0MS0 - - - - (sn:18(espec-ms:17(dem-ms:17)) 18 modnorule - -
18 programa programa NCMS000 - - - - (grup-nom-ms:18(n-ms:18) 16 modnorule - -
19 . . NCFS000 - - - - (n-fs:19))))) 18 modnorule - -

Thank you once again for the help! it means a lot to me!

There are two dependency parsers in FreeLing:

  • dep_txala : rule based parser. Requires the use of the chunker (class chart_parser) the same way that fl4_dependences.cc does it
  • dep_treeler: machine learning parser. It does not require the use of chunker nor dep_txala.

There is no dep_txala grammar for portuguese, so you must use dep_treeler for this language.
This means that you have to change the code in fl4_dependences to create an instance of dep_treeler class, instead of dep_txala.
Since dep_treeler does not require the chunker, you don't need to instantiate nor call the chart_parser class.

I told you about "analyzer", because is the default program for freeling and the easiest way to find out which results it produces on a given scenario.
You can use it to check whether your results are what should be.
If your program gets the same results than "analyzer", then you are doing well. If you get different results, your program has a bug.

For instance, the command:
echo "este é um ficheiro de teste." | analyze -f pt.cfg --outlv dep
produces
root/(ficheiro ficheiro NCMS000 -) [
nsubj/(este este PD0MS00 -)
cop/(é ser VMIP3S0 -)
nsubj/(um um DI0MS0 -)
nmod/(teste teste NCMS000 -) [
case/(de de SP -)
]
nmod/(. . Fp -)
]

And the command
echo "este é um ficheiro de teste." | analyze -f pt.cfg --outlv dep --output conll
produces
1 este este PD0MS00 - - - - - 4 nsubj - -
2 é ser VMIP3S0 - - - - - 4 cop - -
3 um um DI0MS0 - - - - - 4 nsubj - -
4 ficheiro ficheiro NCMS000 - - - - - 0 root - -
5 de de SP - - - - - 6 case - -
6 teste teste NCMS000 - - - - - 4 nmod - -
7 . . Fp - - - - - 4 nmod - -

Finally, note that the programs fl1_, fl2_, etc do the same work than "analyzer". They are only simple examples to illustrate the use of the modules individually.

Unless you have very speficic needs that require that you build your own program, you might be better off simply using "analyzer". It has lots of options that will allow you to customize which modules are used, with which configuration files, which output format is produced, etc.

Hello lluisp, thank you for the reply, in fact the analyzer gives a good output. Now i'm trying to do what you told to do, create a new instance on dep_treeler, and delete the char parser.
But i maybe doing something wrong, because i got an empty output.
what i've done was comment all the chart parser instances, and add new dep and argv[1].


dependency_parser *dep = NULL;
//chart_parser *parser = NULL;
...
//parser = new chart_parser(util::string2wstring(argv[1]));
//dep = new dep_txala (util::string2wstring(argv[2]), parser->get_start_symbol ());

dep = new dep_treeler (util::string2wstring(argv[1]));
...
ls.push_back (av);
//parser->analyze (ls);
dep->analyze (ls);
PrintResults (ls);
...

Analyzer really gives a great output, but if i could do it on the dependences would be great.

Once again, thank you for the help!
Bruno.

Your program looks good.

If it is not printing what you want, is probably because the function PrintResults is printing the tree in the parenthesized freeling format.
You just need to replace the call to PrintResults with a call to output_conll::PrintResults as I described in my first reply.

If you are having a different problem, please be more specific. Post examples of which command you run, which is the input sentence, which output you get, and which output you expected.

Hello lluisp, i've tried what you have said, i input a file that is the output of the Tagger on fl4_dependences like this:
./fl4_dependences /usr/local/share/freeling/pt/dep_treeler/dependences.dat TryOutCONLLV3

Eureka!
the output:

1 este este DD0MS0 - - - - - 4 nsubj - -
2 é ser VMIP3S0 - - - - - 4 cop - -
3 um um DI0MS0 - - - - - 4 nsubj - -
4 ficheiro ficheiro NCMS000 - - - - - 0 root - -
5 de de SP - - - - - 6 case - -
6 teste teste NCMS000 - - - - - 4 nmod - -
7 aqui aqui RG - - - - - 8 advmod - -
8 temos ter VMIP1P0 - - - - - 4 acl - -
9 um um DI0MS0 - - - - - 8 xcomp - -
10 texto texto NCMS000 - - - - - 8 dobj - -
11 que que PR0CN00 - - - - - 12 nsubj - -
12 vai ir VMIP3S0 - - - - - 10 acl:relcl - -
13 servir servir VMN0000 - - - - - 12 xcomp - -
14 de de SP - - - - - 15 case - -
15 teste teste NCMS000 - - - - - 13 nmod - -
16 para para SP - - - - - 13 nmod - -
17 este este DD0MS0 - - - - - 13 nmod - -
18 programa programa NCMS000 - - - - - 13 nmod - -
19 . . NCFS000 - - - - - 18 nmod:npmod - -
.....

Now that i have the ConLL Format, i just need to change the tagset to Penn Treebank (to input on Ollie information extraction)
What's the best way i can do it? maybe a separated program that changes it? what is the matches (correspondence) of that tagset to Penn Treebank tagset ?

Thank you, you're being a great friend to me!
Bruno

Penn Treebank tagset is for English. There is no PTB tagset for Portuguese, neither in FreeLing nor in any other suite. If you are analyzing Portuguese text, you will get EAGLES tags.

If input is English and you use English data, FreeLing will produce PTB tags.

In any case, I think Ollie only works in English, not in Portuguese or other languages.

I'm trying to get Ollie to work with Portuguese, compiling a Portuguese .conll to .mco file with maltparser that ollie will consume. to get the extractions. i was told that my Portugese tagger is using different Part-of-Speech tags than Ollie is trained on. For example, i use "DET" instead of "DT" and "NOUN" instead of "NN". Using different part of speech tags will cause Ollie's patterns not to match, so i was trying to get ollie to work with a mco made by dependences .conll output from freeling. can it be made?

My .conll
1 Um um DET ART PronType=Art|Number=Sing|Gender=Masc 2 det _ _
2 revivalismo revivalismo NOUN N Number=Sing|Gender=Masc 0 root _ _
3 refrescante refrescante ADJ ADJ _ 2 amod _ _

Out .conll from freeling
1 este este DD0MS0 - - - - - 4 nsubj - -
2 é ser VMIP3S0 - - - - - 4 cop - -
3 um um DI0MS0 - - - - - 4 nsubj - -
4 ficheiro ficheiro NCMS000 - - - - - 0 root - -

So that tagset from my freeling output is EAGLES and it has to be like that?

Thank you for all the help!
Bruno

You can do it, but not straighforwardly

First, you need to fix the config file to be aware of the tagset.
Add the following line to pt.cfg
TagsetFile=$FREELINGSHARE/pt/tagset.dat

(or download the fixed file from github master)

Then, analyzer will produce:
1 este este PD0MS00 PD pos=pronoun|type=demonstrative|gen=masculine|num=singular - - - 4 nsubj - -
2 é ser VMIP3S0 VMI pos=verb|type=main|mood=indicative|tense=present|person=3|num=singular - - - 4 cop - -
3 um um DI0MS0 DI pos=determiner|type=indefinite|gen=masculine|num=singular - - - 4 nsubj - -
4 ficheiro ficheiro NCMS000 NC pos=noun|type=common|gen=masculine|num=singular - - - 0 root - -
5 de de SP SP pos=adposition|type=preposition - - - 6 case - -
6 teste teste NCMS000 NC pos=noun|type=common|gen=masculine|num=singular - - - 4 nmod - -
7 . . Fp Fp pos=punctuation|type=period - - - 4 nmod - -

After that, you need to map the PoS tags to the tagset that Ollie expects.
You need to write a script that does this. You will need to decide which tag is equivalent to which (check meaning of tags in freeling user manual)

In any case, tagset conversion is a tricky process (not all tagsets have the same granularity or the same criteria to establish the list of categories, or to decide which category a word belongs to). Moreover, PTB is designed for English, so some Portuguese categories may not be represented and viceversa.

Finally, Ollie is trained for English, and it probably needs more than just the same PoS tags to work: It probably uses word forms and lemmas (which it will expect to be English) and dependency labels (which are not the same in English and Portuguese, so you will need to map those too).

Thank you onde again lluisp, it's tricky, and PTB is really hard to match with Eagles tagset, it's almost impossible to do so.
I just wanted to extract some relations from a input text for my thesis, but it's being really hard, since afeter that long work i might not be able to do so.. and that's heart breaking.
I've just matchmaking some tags to PTB, just to check if there's some output, but theres none extractions found. I don't know what to do now, maybe try it with new OpenIE 4.x, because even if matchmake some tags, which seems to be impossible, i think that those forms and lemmas will break the program, and to matchmake all those thing, i don't know... i really don't know..

Thank you lluisp.
Bruno Marques

You can try to extract some relations by other means.

Most of Ollie relations are based on the syntactic structure of the tree (e.g. someone says something, an event is conditioned to another with an "if" clause, etc).

An easy way can be feeding some of these sentences in Portuguese to FreeLing, see which tree structures it produces, and then, write a small program that looks for those structures in any given tree.

In this way, you would have your own mini-extractor, customized to the relations you need (it will probably detect less things than Ollie does, but it may be enough for your purposes).

Hello lluisp.
So, i have outputs from Morph, Tagger, Chunker, Dependencies.
First of all I have to focus on a scenery, right? To have/do a Name Entity Recognition which mean the objective of the information that i want to extract. Am i thinking right?
then, i have to feed the Freeling with the input text, where i'm going to perform the IE, but to which output do i have to look? Chunker or Dependencies?

Thank you for all the help lluisp.
Bruno Marques

You can use either chunker or dependencies. That would depend on your goal.
I think dependencies is better because you can look for patterns that include the syntactic function (e.g. find a verb "say" with a subject that is a person and a direct object, and extract the triplet)

Hello lluisp. sorry for not saying anything sooner.
I didn't told you, but i think it's appropriate, the objective of that work is to analyze comments from a source, like comments from hotels, or restaurants or any other service that i might be interested, and then extract informations about the positive and negative aspects.
By having what i have, whats the best option? Look at the dependences file, or chunker, and perform a information extraction based on rules? like extract all that have positive aspects (liked, good, enjoyable, love...) and negative ones (didn't like, rude, bad, smelly...) and to what they refer to?? i'm a little bit lost on that phase.

Thank you once again for all the help.
Bruno Marques

This forum is for FreeLing issues, and I think I have solved yours.

If you want information on how to build a sentiment analysis system, that is know-how beyond how to use FreLing.

If you need this for academic purposes, you should ask your professor or advisor.
If you want it for bussiness purposes, you are welcome to hire our consulting services. Please contact me by email.

Hello lluisp.
No, you miss understood me, i've just wanted to share what the core of the work, and for what do i need all this work i've done. And greeting you for helped me. You told me i could perform IE no the chunker or depend output file, that's what my question lies for. But i got a solution already.
If i got any other questions about Freeling, it would be my pleasure to ask here!

Thank you once again for the precious help!