Correct misspelled words

Submitted by carlesg on Thu, 05/26/2016 - 19:11
Forums

Hello!

I am using Freeling to parse sentences, to get its dependency tree, syntactic function of each node, and the lemma of each word.
I use version 3.1 of Freeling, from a java program (via your native interface compiled with Swig). The language I use is Spanish.

The problem I have is with regard to misspelled words (or specifically accented words written without accent). In my case, the word without accent also exists in the dictionary. The fact that the word is wrongly identified, makes that the word is assigned a wrong PoS, the dependency tree is also wrong, and syntactic / constituent is also wrong.

For example, in the case following, the words 'último' and 'cuánto' were written without accent ('ultimo' and 'cuanto'). Freeling identify 'ultimo' as a verb (instead of an adjective, which would correspond to the word with accent) and 'cuanto' as an adverb (rather than an interrogative pronoun). These words appear in the analysis with a single sense.

Is there any way to make Freeling analyze words, considering typing errors?
How can I configure Freeling, or do the analysis, so Freeling use 'último' instead of 'ultimo' to get the PoS, dependency tree, syntactic function, etc.

Example of erroneous words ('¿Cuanto he gastado el ultimo año?'):

grup-verb/top/(gastado gastar VMP00SM -) [
vaux/aux/(he haber VAIP1S0 -)
F-no-c/term/(¿ ¿ Fia -)
sadv/cc/(Cuanto cuanto RG -)
espec-ms/modnomatch/(el el DA0MS0 -)
grup-verb/modnomatch/(ultimo ultimar VMIP1S0 -) [
sn/cc/(año año NCMS000 -)
F-term/term/(? ? Fit -)
]
]

Example of correct words ('¿Cuánto he gastado el último año?'):

grup-verb/top/(gastado gastar VMP00SM -) [
vaux/aux/(he haber VAIP1S0 -)
F-no-c/term/(¿ ¿ Fia -)
sn/subj/(Cuánto cuánto PT0MS000 -)
sn/cc/(año año NCMS000 -) [
espec-ms/espec/(el el DA0MS0 -)
s-a-ms/adj-mod/(último último AO0MS0 -)
]
F-term/term/(? ? Fit -)
]

Thanks

"analyzer" is just an example program, and it does not offer access to all FreeLing possibilities. As it comes out of the box, "analyzer" is designed to process properly written text.

However, FreeLing contains some modules and tricks that can help you process mispelled texts, but you'll need to combine them yourself in your own main program (or alter analyzer to do it)

Some possibiilties:

  • Add mispelled words to the dictionary (e.g. add "ultimo" without accent as an adjective).
    Then the tagger will have the chance to select a different tag for that word.
    Nevertheless, this will add a lot of ambiguity to the dictioanry, so the tagger performance (and thus, the parser's) will suffer.
  • To palliate the ambiguity introduced by adding mispelled words to the dictionary, you can configure the tagger to select the k-best tag sequences instead of only the best one (see technical documentation on hmm_tagger class). Then, you can parse the different options and somehow select the most plausible one
  • FreeLing as a module called "alternatives" that will suggest words that are written or pronounced similarly to the given word. You can use this module to find words that are very similar to "ultimo" (you will get "último") and then you can e.g. add the tags for the obtained alternatives to the word before running the sentence through the tagger