RelaxCor: Coreference Resolution system.


Emili Sapena :: Natural Language Processing Group :: LSI - UPC

NAME

RelaxCor: A constraint-based hypergraph partitioning approach to coreference resolution solved by relaxation labeling. An open source software to resolve coreferences in text documents.

VERSION

1.1

AUTHOR

Emili Sapena Masip, Universitat Politècnica de Catalunya

Not visible without javascript (anti-spam)


Natural language processing group

COPYRIGHT AND LICENSE

Copyright (C) 2011-2012, Emili Sapena Masip,

Not visible without javascript (anti-spam)

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

INSTALLATION

Requirements:

  • 1. Perl.
  • 2. Algorithm-Munkres: included in this package and downloadable from CPAN
  • 3. C4.5 decision trees and rules induction. C4.5: Programs for Machine Learning, Quinlan 1993. Included in this package. The included version has modifications to accept larger data sets
  • 4. WordNet. WordNet: a lexical database for English, Miller 1995
  • 5. Other Perl libraries downloadable from CPAN
  • 6. gcc or any standard C++ compiler.
  • Instructions:

  • 1. Extract the package in a folder.
  • 2. Edit file lib/Globals.pm and change the paths with the corresponding values for your system.
  • 3. Edit file lib/Document.inc and assing the corresponding values for the columns of the input files.
  • 4. Compile AttributesC-src:
    $ cd /path/to/relaxcor/lib/AttributesC-src
    $ perl Makefile.PL
    $ make
    
  • 5. Compile RelaxGraph-src:
          
    $ cd /path/to/relaxcor/lib/RelaxGraph-src
    $ perl Makefile.PL
    $ make
    
  • USE

    This package is distributed with five Perl scripts to execute the system from the command line.

    Training: run-train.pl
    Development: run-devel.pl
    Resolution: run-relaxcor.pl
    Scorer: scorer.pl
    HTML output generator: gen-html.pl

    TRAINING

    The training process uses the feature vectors of each pair of mentions of the training documents to generate a set of weighted constraints.

    • 1. The system (d)etects the boundaries of the mentions. This step is optional and true mentions can be used instead.
      $ ./run-train.pl d path/to/train/*.txt
      
    • 2. A feature vector is (g)enerated for each pair of mentions in the training files.
         
      $ ./run-train.pl g path/to/train/*.txt
      
      You can execute both steps in the same call:
         
      $ ./run-train.pl dg path/to/train/*.txt
      
    • 3. Training process. A set of constraints is learned from the training examples. You need to specify the name of the model and the section. Section should be the name of the corpora, the language, or anything that specializes the model for a specific data set.
         
      $ ./run-train.pl --model=model_name --section=corpus_name t path/to/train/*.txt
      
      This process can be divided into 10 steps. The first 9 steps (0..8) are independent and can be executed in parallel.
         
      $ ./run-train.pl --model=model_name --section=corpus_name --step=0 t path/to/train/*.txt
      $ ./run-train.pl --model=model_name --section=corpus_name --step=1 t path/to/train/*.txt
      $ ./run-train.pl --model=model_name --section=corpus_name --step=2 t path/to/train/*.txt
      ...
      
      The last step (9) must be executed after the previous steps.
         
      $ ./run-train.pl --model=model_name --section=corpus_name --step=9 t path/to/train/*.txt
      

    DEVELOPMENT

    The development process solves and evaluates the development set using all the combinations for two parameters: Nprune and Balance. First of all, you need to detect mentions, generate attributes and apply constraints to the development files:

       
    $ ./run-train.pl --model=model_name --section=corpus_name dga path/to/devel/*.txt
    

    Then, execute the development process to find the optimal parameters given a measure (or metric): muc, ceafm, ceafe, bcub, or m (i.e. (muc+ceafe+bcub)/3)

    $ ./run-devel.pl --measure=m <model_name> <corpus_name> path/to/devel/*.txt
    

    RESOLUTION

    The resolution process has 4 possible actions: (d)etect mentions, (g)enerate features, (a)pply constraints and (s)olve. You need to execute all these actions in the input files to solve their coreferences.

       
    $ ./run-relaxcor.pl --model=model_name --section=corpus_name dgas path/to/test/*.txt
    

    INPUT

    The format of the input files must be a columns format where each line is a token like in SemEval-2010 task 1 and CoNLL-2011 shared task.

    HTML GENERATION

    A script to generate HTML of the outputs is also included. It is useful in order to visualize the results of a run.

    Generates an html file of a solved document:

       
    $ ./gen-html.pl path/to/devel/*.txt
    

    Generates htmls for each file and an index:

       
    $ ./gen-html.pl --index path/to/test/*.txt
    

    An index file is created at path/to/test/index.html linking all the other html files.

    SCORER

    The scorer evaluates the outputs using several measures.

    use:

       
    $ ./scorer.pl <metric> <keys_file> <response_file> [name]
    
       metric: the metric desired to score the results:
          muc:        MUCScorer (Vilain et al, 1995)
          bcub:       B-Cubed (Bagga and Baldwin, 1998)
          ceafm:      CEAF (Luo et al, 2005) using mention-based similarity
          ceafe:      CEAF (Luo et al, 2005) using entity-based similarity
          blanc:      BLANC (Recasens and Hovy, doi: 10.1017/S135132491000029X)
          <metric>rc: Variation of the metric using Resolution Class
                      annotation (Stoyanov et al, 2009)
          all:        uses all the metrics to score
    
       keys_file: file with expected coreference chains in SemEval format
    
       response_file: file with output of coreference system (SemEval format)
    
       name: [optional] the name of the document to score. If name is not given,
          all the documents in the dataset will be scored. If given name is "none"
          then all the documents are scored but only total results are shown. Use
          name "__single" to evaluate a single file when .out and .gold are
          available.
    

    GROUP CONSTRAINTS AND ENTITY-MENTION MODEL

    In order to use a model that evaluates more than two mentions at once, a set of special instructions should be followed.

    = Creating a model =

    Entity-mention models are not automatically learned. Instead, a set of constraints must be manually written. To create a model follow this steps:

    • 1. Create a directory in /models/ with the name of your model
      $ mkdir models/myN4model
      
    • 2. Create a constraints file with the specification of each constraint with this format: $STEM-$section.constraints
      For example: models/myN4model/exp0010-semeval-en.constraints
    • 3. Edit the constraints file and add your constraints following this format: <active mention>#<influence condition>#constraint
      For example:
      2#0:1,3#$a{DIST_SEN_0_01} && $a{DIST_SEN_L3_02} && $a{DIST_SEN_L3_03} && $a{ALIAS_YES_13}
      && $a{NESTED_01} && $a{NESTED_23} && $a{SEMCLASS_YES_02} && $a{TYPE_E_1}
      
      Note that the constraint is executed in a perl script as is. If the result is true, the constraint applies to this group.
              
              active mention: the number from 0 to N of the mention that gets the
                              influence.
      
              influence condition: the condition that the implied mentions should
                                   satisfy in order to make influence over the active
                                   mention. Each entity (group of mentions) is
                                   separated by ':'. The example 0:1,3 means that
                                   mentions 1 and 3 belong to the same entity and it
                                   is different that the entity of mention 0.
                                   The active mention gets positive influence from
                                   the first group (0 in the example) and negative
                                   influence from the rest (1,2 in this case).
      
              constraint: A condition that features must satisfy. Any condition can
                          be included here
                          
                          ($a{feature1} && $a{feature2} == 0) || $a{feature3} == $a{feature4} ...
                          
                          Features are the same of the mention-pair model but adding
                          one or two numbers at the end indicating which mentions of
                          the group are evaluated.
                          
                              mention-pair    |    entity-mention
                                              |
                              DIST_SEN_0      |      DIST_SEN_0_01
                              J_POSSESSIVE    |      POSSESSIVE_1
                              ALIAS_YES       |      ALIAS_YES_01
      
    • 4. Assign a weight to each group constraint. Create a file with this format: $STEM-$section.weights with comma-separated values, where each value corresponds to the weight of the constraint following the order of the constraints file.
      For example:
      models/myN4model/exp0010-semeval-en.weights
      Content: 10,1,0.25
      

    = Applying the model to documents =

    In order to apply to a document the constraints of an entity-mention model, the option --Ngroup must be used:

         
    $ ./run-train.pl --model=myN4model --section=corpus_name --Ngroup=N a path/to/documents/*.txt
    

    where N is the order of the constraints. Only N=3 and N=4 are accepted in this version.

    = Executing RelaxCor using an entity-mention model =

    Use the same command to run RelaxCor but including the option --emodel=<model name>
    for example:

    $ ./run-relaxcor.pl --model=model_name --section=corpus_name --emodel=myN4model s path/to/test/*.txt
    

    The execution has a mention-pair model as a base model, and an entity-mention model.

    = Development =

    When using an entity-mention model in combination with a mention-pair model, a development process may improve overall performances. The learned parameters modify the weights of the mention-pair model (not the entity-mention one).

         
    $ ./run-devel.pl --measure=m --emodel=myN4model   path/to/devel/*.txt
    

    SEE ALSO

    Emili Sapena and Lluís Padró and Jordi Turmo.
    RelaxCor: An Open Source Coreference Resolution System
    Report.
    [pdf]

     

    Emili Sapena and Lluís Padró and Jordi Turmo.
    A Global Relaxation Labeling Approach to Coreference Resolution
    Proceedings of 23rd International Conference on Computational Linguistics, COLING,
    Beijing, China. August, 2010.
    [pdf] [bibtex]

     

    Emili Sapena and Lluís Padró and Jordi Turmo.
    RelaxCor Participation in CoNLL Shared Task on Coreference Resolution
    Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pg. 35--39.
    Association for Computational Linguistics.
    Portland, Oregon, USA. June, 2011.
    [pdf] [bibtex]

     

    Emili Sapena and Lluís Padró and Jordi Turmo.
    RelaxCor: A Global Relaxation Labeling Approach to Coreference Resolution
    Proceedings of the ACL Workshop on Semantic Evaluations (SemEval-2010),
    Uppsala, Sweden. July, 2010.
    [pdf] [bibtex]

     

    Emili Sapena.
    A constraint-based hypergraph partitioning approach to coreference resolution. PhD thesis. Universitat Politecnica de Catalunya, 2012.

    DOWNLOAD

    Download RelaxCor v1.1