RelaxCor: Open source coreference resolution system.

NAME

RelaxCor: A constraint-based hypergraph partitioning approach to coreference resolution solved by relaxation labeling. An open source software to resolve coreferences in text documents.

AUTHOR

Emili Sapena Masip, Universitat Politècnica de Catalunya

Natural language processing group

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

INSTALLATION

Requirements:

1. Perl.
2. Algorithm-Munkres: included in this package and downloadable from CPAN
3. C4.5 decision trees and rules induction. C4.5: Programs for Machine Learning, Quinlan 1993. Included in this package. The included version has modifications to accept larger data sets
4. WordNet. WordNet: a lexical database for English, Miller 1995
5. Other Perl libraries downloadable from CPAN
6. gcc or any standard C++ compiler.

Instructions:

1. Extract the package in a folder.
2. Edit file lib/Globals.pm and change the paths with the corresponding values for your system.
3. Edit file lib/Document.inc and assing the corresponding values for the columns of the input files.

4. Compile AttributesC-src:

$ cd /path/to/relaxcor/lib/AttributesC-src
$ perl Makefile.PL
$ make

5. Compile RelaxGraph-src:

      
$ cd /path/to/relaxcor/lib/RelaxGraph-src
$ perl Makefile.PL
$ make

USE

This package is distributed with five Perl scripts to execute the system from the command line.

Training: run-train.pl
Development: run-devel.pl
Resolution: run-relaxcor.pl
Scorer: scorer.pl
HTML output generator: gen-html.pl

TRAINING

The training process uses the feature vectors of each pair of mentions of the training documents to generate a set of weighted constraints.

1. The system (d)etects the boundaries of the mentions. This step is optional and true mentions can be used instead.
```
$ ./run-train.pl d path/to/train/*.txt
```
2. A feature vector is (g)enerated for each pair of mentions in the training files.
```
   
$ ./run-train.pl g path/to/train/*.txt
```
You can execute both steps in the same call:
```
   
$ ./run-train.pl dg path/to/train/*.txt
```

3. Training process. A set of constraints is learned from the training examples. You need to specify the name of the model and the section. Section should be the name of the corpora, the language, or anything that specializes the model for a specific data set.

   
$ ./run-train.pl --model=model_name --section=corpus_name t path/to/train/*.txt

This process can be divided into 10 steps. The first 9 steps (0..8) are independent and can be executed in parallel.

   
$ ./run-train.pl --model=model_name --section=corpus_name --step=0 t path/to/train/*.txt
$ ./run-train.pl --model=model_name --section=corpus_name --step=1 t path/to/train/*.txt
$ ./run-train.pl --model=model_name --section=corpus_name --step=2 t path/to/train/*.txt
...

The last step (9) must be executed after the previous steps.

   
$ ./run-train.pl --model=model_name --section=corpus_name --step=9 t path/to/train/*.txt

DEVELOPMENT

The development process solves and evaluates the development set using all the combinations for two parameters: Nprune and Balance. First of all, you need to detect mentions, generate attributes and apply constraints to the development files:

   
$ ./run-train.pl --model=model_name --section=corpus_name dga path/to/devel/*.txt

Then, execute the development process to find the optimal parameters given a measure (or metric): muc, ceafm, ceafe, bcub, or m (i.e. (muc+ceafe+bcub)/3)

$ ./run-devel.pl --measure=m <model_name> <corpus_name> path/to/devel/*.txt

RESOLUTION

The resolution process has 4 possible actions: (d)etect mentions, (g)enerate features, (a)pply constraints and (s)olve. You need to execute all these actions in the input files to solve their coreferences.

   
$ ./run-relaxcor.pl --model=model_name --section=corpus_name dgas path/to/test/*.txt

INPUT

The format of the input files must be a columns format where each line is a token like in SemEval-2010 task 1 and CoNLL-2011 shared task.

HTML GENERATION

A script to generate HTML of the outputs is also included. It is useful in order to visualize the results of a run.

Generates an html file of a solved document:

   
$ ./gen-html.pl path/to/devel/*.txt

Generates htmls for each file and an index:

   
$ ./gen-html.pl --index path/to/test/*.txt

An index file is created at path/to/test/index.html linking all the other html files.

SCORER

The scorer evaluates the outputs using several measures.

use:

   
$ ./scorer.pl <metric> <keys_file> <response_file> [name]

   metric: the metric desired to score the results:
      muc:        MUCScorer (Vilain et al, 1995)
      bcub:       B-Cubed (Bagga and Baldwin, 1998)
      ceafm:      CEAF (Luo et al, 2005) using mention-based similarity
      ceafe:      CEAF (Luo et al, 2005) using entity-based similarity
      blanc:      BLANC (Recasens and Hovy, doi: 10.1017/S135132491000029X)
      <metric>rc: Variation of the metric using Resolution Class
                  annotation (Stoyanov et al, 2009)
      all:        uses all the metrics to score

   keys_file: file with expected coreference chains in SemEval format

   response_file: file with output of coreference system (SemEval format)

   name: [optional] the name of the document to score. If name is not given,
      all the documents in the dataset will be scored. If given name is "none"
      then all the documents are scored but only total results are shown. Use
      name "__single" to evaluate a single file when .out and .gold are
      available.

GROUP CONSTRAINTS AND ENTITY-MENTION MODEL

In order to use a model that evaluates more than two mentions at once, a set of special instructions should be followed.

= Creating a model =

Entity-mention models are not automatically learned. Instead, a set of constraints must be manually written. To create a model follow this steps:

1. Create a directory in /models/ with the name of your model
```
$ mkdir models/myN4model
```
2. Create a constraints file with the specification of each constraint with this format: $STEM-$section.constraints
For example: models/myN4model/exp0010-semeval-en.constraints

3. Edit the constraints file and add your constraints following this format: <active mention>#<influence condition>#constraint
For example:

2#0:1,3#$a{DIST_SEN_0_01} && $a{DIST_SEN_L3_02} && $a{DIST_SEN_L3_03} && $a{ALIAS_YES_13}
&& $a{NESTED_01} && $a{NESTED_23} && $a{SEMCLASS_YES_02} && $a{TYPE_E_1}

Note that the constraint is executed in a perl script as is. If the result is true, the constraint applies to this group.

        
        active mention: the number from 0 to N of the mention that gets the
                        influence.

        influence condition: the condition that the implied mentions should
                             satisfy in order to make influence over the active
                             mention. Each entity (group of mentions) is
                             separated by ':'. The example 0:1,3 means that
                             mentions 1 and 3 belong to the same entity and it
                             is different that the entity of mention 0.
                             The active mention gets positive influence from
                             the first group (0 in the example) and negative
                             influence from the rest (1,2 in this case).

        constraint: A condition that features must satisfy. Any condition can
                    be included here
                    
                    ($a{feature1} && $a{feature2} == 0) || $a{feature3} == $a{feature4} ...
                    
                    Features are the same of the mention-pair model but adding
                    one or two numbers at the end indicating which mentions of
                    the group are evaluated.
                    
                        mention-pair    |    entity-mention
                                        |
                        DIST_SEN_0      |      DIST_SEN_0_01
                        J_POSSESSIVE    |      POSSESSIVE_1
                        ALIAS_YES       |      ALIAS_YES_01

4. Assign a weight to each group constraint. Create a file with this format: $STEM-$section.weights with comma-separated values, where each value corresponds to the weight of the constraint following the order of the constraints file.
For example:
```
models/myN4model/exp0010-semeval-en.weights
Content: 10,1,0.25
```

= Applying the model to documents =

In order to apply to a document the constraints of an entity-mention model, the option --Ngroup must be used:

     
$ ./run-train.pl --model=myN4model --section=corpus_name --Ngroup=N a path/to/documents/*.txt

where N is the order of the constraints. Only N=3 and N=4 are accepted in this version.

= Executing RelaxCor using an entity-mention model =

Use the same command to run RelaxCor but including the option --emodel=<model name>
for example:

$ ./run-relaxcor.pl --model=model_name --section=corpus_name --emodel=myN4model s path/to/test/*.txt

The execution has a mention-pair model as a base model, and an entity-mention model.

= Development =

When using an entity-mention model in combination with a mention-pair model, a development process may improve overall performances. The learned parameters modify the weights of the mention-pair model (not the entity-mention one).

     
$ ./run-devel.pl --measure=m --emodel=myN4model   path/to/devel/*.txt

DOWNLOAD

Download RelaxCor v1.1

Installation

Training

Development

Resolution

Input

Output

Scorer

Entity-mention

References

Download

RelaxCor: Coreference Resolution system.

Emili Sapena :: Natural Language Processing Group :: LSI - UPC