RelaxCor: Coreference Resolution system.
Emili Sapena :: Natural Language Processing Group :: LSI - UPC
NAME
RelaxCor: A constraint-based hypergraph partitioning approach to coreference resolution solved by relaxation labeling. An open source software to resolve coreferences in text documents.
VERSION
1.1
AUTHOR
Emili Sapena Masip, Universitat Politècnica de Catalunya
Natural language processing group
COPYRIGHT AND LICENSE
Copyright (C) 2011-2012, Emili Sapena Masip,
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
INSTALLATION
Requirements:
- 1. Perl.
- 2. Algorithm-Munkres: included in this package and downloadable from CPAN
- 3. C4.5 decision trees and rules induction. C4.5: Programs for Machine Learning, Quinlan 1993. Included in this package. The included version has modifications to accept larger data sets
- 4. WordNet. WordNet: a lexical database for English, Miller 1995
- 5. Other Perl libraries downloadable from CPAN
- 6. gcc or any standard C++ compiler.
- 1. Extract the package in a folder.
- 2. Edit file lib/Globals.pm and change the paths with the corresponding values for your system.
- 3. Edit file lib/Document.inc and assing the corresponding values for the columns of the input files.
- 4. Compile AttributesC-src:
$ cd /path/to/relaxcor/lib/AttributesC-src $ perl Makefile.PL $ make
- 5. Compile RelaxGraph-src:
$ cd /path/to/relaxcor/lib/RelaxGraph-src $ perl Makefile.PL $ make
- 1. The system (d)etects the boundaries of the mentions. This step is
optional and true mentions can be used instead.
$ ./run-train.pl d path/to/train/*.txt
- 2. A feature vector is (g)enerated for each pair of mentions in the training
files.
$ ./run-train.pl g path/to/train/*.txt
You can execute both steps in the same call:$ ./run-train.pl dg path/to/train/*.txt
- 3. Training process. A set of constraints is learned from the training
examples. You need to specify the name of the model and the section. Section
should be the name of the corpora, the language, or anything that
specializes the model for a specific data set.
$ ./run-train.pl --model=model_name --section=corpus_name t path/to/train/*.txt
This process can be divided into 10 steps. The first 9 steps (0..8) are independent and can be executed in parallel.$ ./run-train.pl --model=model_name --section=corpus_name --step=0 t path/to/train/*.txt $ ./run-train.pl --model=model_name --section=corpus_name --step=1 t path/to/train/*.txt $ ./run-train.pl --model=model_name --section=corpus_name --step=2 t path/to/train/*.txt ...
The last step (9) must be executed after the previous steps.$ ./run-train.pl --model=model_name --section=corpus_name --step=9 t path/to/train/*.txt
Instructions:
USE
This package is distributed with five Perl scripts to execute the system from the command line.
Training: run-train.pl
Development: run-devel.pl
Resolution: run-relaxcor.pl
Scorer: scorer.pl
HTML output generator: gen-html.pl
TRAINING
The training process uses the feature vectors of each pair of mentions of the training documents to generate a set of weighted constraints.
DEVELOPMENT
The development process solves and evaluates the development set using all the combinations for two parameters: Nprune and Balance. First of all, you need to detect mentions, generate attributes and apply constraints to the development files:
$ ./run-train.pl --model=model_name --section=corpus_name dga path/to/devel/*.txt
Then, execute the development process to find the optimal parameters given a measure (or metric): muc, ceafm, ceafe, bcub, or m (i.e. (muc+ceafe+bcub)/3)
$ ./run-devel.pl --measure=m <model_name> <corpus_name> path/to/devel/*.txt
RESOLUTION
The resolution process has 4 possible actions: (d)etect mentions, (g)enerate features, (a)pply constraints and (s)olve. You need to execute all these actions in the input files to solve their coreferences.
$ ./run-relaxcor.pl --model=model_name --section=corpus_name dgas path/to/test/*.txt
INPUT
The format of the input files must be a columns format where each line is a token like in SemEval-2010 task 1 and CoNLL-2011 shared task.
HTML GENERATION
A script to generate HTML of the outputs is also included. It is useful in order to visualize the results of a run.
Generates an html file of a solved document:
$ ./gen-html.pl path/to/devel/*.txt
Generates htmls for each file and an index:
$ ./gen-html.pl --index path/to/test/*.txt
An index file is created at path/to/test/index.html linking all the other html files.
SCORER
The scorer evaluates the outputs using several measures.
use:
$ ./scorer.pl <metric> <keys_file> <response_file> [name]
metric: the metric desired to score the results: muc: MUCScorer (Vilain et al, 1995) bcub: B-Cubed (Bagga and Baldwin, 1998) ceafm: CEAF (Luo et al, 2005) using mention-based similarity ceafe: CEAF (Luo et al, 2005) using entity-based similarity blanc: BLANC (Recasens and Hovy, doi: 10.1017/S135132491000029X) <metric>rc: Variation of the metric using Resolution Class annotation (Stoyanov et al, 2009) all: uses all the metrics to score keys_file: file with expected coreference chains in SemEval format response_file: file with output of coreference system (SemEval format) name: [optional] the name of the document to score. If name is not given, all the documents in the dataset will be scored. If given name is "none" then all the documents are scored but only total results are shown. Use name "__single" to evaluate a single file when .out and .gold are available.
GROUP CONSTRAINTS AND ENTITY-MENTION MODEL
In order to use a model that evaluates more than two mentions at once, a set of special instructions should be followed.
= Creating a model =
Entity-mention models are not automatically learned. Instead, a set of constraints must be manually written. To create a model follow this steps:
- 1. Create a directory in /models/ with the name of your model
$ mkdir models/myN4model
- 2. Create a constraints file with the specification of each constraint
with this format: $STEM-$section.constraints
For example: models/myN4model/exp0010-semeval-en.constraints - 3. Edit the constraints file and add your constraints following this
format: <active mention>#<influence condition>#constraint
For example:
2#0:1,3#$a{DIST_SEN_0_01} && $a{DIST_SEN_L3_02} && $a{DIST_SEN_L3_03} && $a{ALIAS_YES_13} && $a{NESTED_01} && $a{NESTED_23} && $a{SEMCLASS_YES_02} && $a{TYPE_E_1}
Note that the constraint is executed in a perl script as is. If the result is true, the constraint applies to this group.
active mention: the number from 0 to N of the mention that gets the influence. influence condition: the condition that the implied mentions should satisfy in order to make influence over the active mention. Each entity (group of mentions) is separated by ':'. The example 0:1,3 means that mentions 1 and 3 belong to the same entity and it is different that the entity of mention 0. The active mention gets positive influence from the first group (0 in the example) and negative influence from the rest (1,2 in this case). constraint: A condition that features must satisfy. Any condition can be included here ($a{feature1} && $a{feature2} == 0) || $a{feature3} == $a{feature4} ... Features are the same of the mention-pair model but adding one or two numbers at the end indicating which mentions of the group are evaluated. mention-pair | entity-mention | DIST_SEN_0 | DIST_SEN_0_01 J_POSSESSIVE | POSSESSIVE_1 ALIAS_YES | ALIAS_YES_01
- 4. Assign a weight to each group constraint. Create a file with this
format: $STEM-$section.weights with comma-separated values, where each
value corresponds to the weight of the constraint following the order
of the constraints file.
For example:models/myN4model/exp0010-semeval-en.weights Content: 10,1,0.25
= Applying the model to documents =
In order to apply to a document the constraints of an entity-mention model, the option --Ngroup must be used:
$ ./run-train.pl --model=myN4model --section=corpus_name --Ngroup=N a path/to/documents/*.txt
where N is the order of the constraints. Only N=3 and N=4 are accepted in this version.
= Executing RelaxCor using an entity-mention model =
Use the same command to run RelaxCor but including the option --emodel=<model name>
for example:
$ ./run-relaxcor.pl --model=model_name --section=corpus_name --emodel=myN4model s path/to/test/*.txt
The execution has a mention-pair model as a base model, and an entity-mention model.
= Development =
When using an entity-mention model in combination with a mention-pair model, a development process may improve overall performances. The learned parameters modify the weights of the mention-pair model (not the entity-mention one).
$ ./run-devel.pl --measure=m --emodel=myN4modelpath/to/devel/*.txt
SEE ALSO
Emili Sapena and Lluís Padró and Jordi Turmo.
RelaxCor: An Open Source Coreference Resolution System
Report.
[pdf]
Emili Sapena and Lluís Padró and Jordi Turmo.
A Global Relaxation Labeling Approach to Coreference Resolution
Proceedings of 23rd International Conference on Computational Linguistics, COLING,
Beijing, China. August, 2010.
[pdf] [bibtex]
Emili Sapena and Lluís Padró and Jordi Turmo.
RelaxCor Participation in CoNLL Shared Task on Coreference Resolution
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pg. 35--39.
Association for Computational Linguistics.
Portland, Oregon, USA. June, 2011.
[pdf] [bibtex]
Emili Sapena and Lluís Padró and Jordi Turmo.
RelaxCor: A Global Relaxation Labeling Approach to Coreference Resolution
Proceedings of the ACL Workshop on Semantic Evaluations (SemEval-2010),
Uppsala, Sweden. July, 2010.
[pdf] [bibtex]
Emili Sapena.
A constraint-based hypergraph partitioning approach to coreference resolution.
PhD thesis. Universitat Politecnica de Catalunya, 2012.
DOWNLOAD
Download RelaxCor v1.1