Dependency parser generated info after analysis

Submitted by Josenp1311 on Fri, 01/21/2022 - 10:23

Forums

Hello everyone.

I'm trying to replicate the dependency parsing results from the demo 4.2 on the Java 4.0 version, but I'm having some troubles doing so.

In the 4.2 demo, in the CONLL format, there are two new columns generated after selecting the dependency parsing, that's the information I'm looking for.

First. I can't find where that information is stored after doing the analysis with the Rule-based Dependecy Parser, I'm assuming the information is stored on the DepTree generated in the sentence after analyze it, but I don't know exactly where. The closest thing I found is the label generated in the nodes, but the labels generated don't match with the ones on the 4.2 demo.

I don't know if I'm looking at the wrong place, the labels are different on the 4.0 version, I'm doing the wrong dependency parsing method, or what I'm trying to do is impossible in the 4.0 version.

Thank you for your time!

You question is a bit…

You question is a bit confusing... "two new columns"... which two exactly? columns are always there, but depending on the used analyzers they have just dashes...

I'm going to assume that you mean columns 10 and 11 as shown in the demo, which contain the dependency head and function for each token.

If you use FreeLing API, you'll get a dep_tree with the same information. Only that the head number is not a number, but just the parent of each node (so you don't need the numbers, just accessing the parent node does the trick). The function is the attribute 'label' in the node, that can be obtained with get_label().

About labels matching between versions, that depends on which parser you use. The rule-based parser uses a custom set of labels. Statistical parsers (treeler and lstm) use the label set from the training corpus (which is CoNLL-X shared task corpus).
If you use the same parser in both versions, labels should be the same.

Notice that the web demo uses the lstm parser. If you use the rule-based parser, labels will not be the same.

Finally, notice also that the rule-based parser is a rather obsolete and its performance is below the state-of-the-art. You'll get better results with the lstm parser in 4.2.
If you want to stick with 4.0 (not recommended, it is over 5 years old), you should use dep_treeler parser, not the rule based one.

If all this does not answer your question, please be clear in what is the problem. An example may help.