I would like to remove entirely the use of "_" (underscore) from the tagging output and have each word in the group tagged separately.
Here are some examples of what I mean, along with the sections of the original text:
LaMuerte.txt:A_el_poco_rato al_poco_rato RG 1 [Al poco rato empezaron a sentirse mal]
abanderado.txt:por_medio_de por_medio_de SP 1 [le señalan las faltas del juego por medio de banderas]
abajo.txt:En_dirección_a en_dirección_a SP 1 [En dirección a la parte más baja]
calabazo cala_bazo NCFS000 0.575281 [un enorme calabazo lleno de piedras]
How can I do this?
There are several reasons…
There are several reasons why you get this underscores, depending on the involved words, and the solution is different in each case
Named entities, dates, physical quantities, and multiword expressions, are detected by some modules in FreeLing, and they are glued together. The responsible modules can be selectively deactivated.
The solutions described below assume you have FreeLing installed in your computer, and that you have access to the user manual https://freeling-user-manual.readthedocs.io/en/v4.2/ for details about specific options and how to use it.
If you are using FreeLing via textserver, then the answer is shorter: Textserver offers pre-canned services based on FreeLing, but it does not expose all possible options or behaviours of the analyzer. If you want to customize FreeLing behaviour, you need to download and install it locally.
So, assuming you have FreeLing in your computer, and you input the sentece:
"Jose Perez vino el martes a las doce con treinta y dos naranjas, al poco rato se bebió dos litros de agua."
Jose_Perez jose_perez NP00SP0 1
vino venir VMIS3S0 0.1875
el el DA0MS0 1
martes_a_las_doce [M:??/??/??:12.00:??] W 1
con con SP 1
treinta_y_dos 32 Z 1
naranjas naranja NCCP000 0.638706
, , Fc 1
a_el_poco_rato al_poco_rato RG 1
se se P00CN00 0.494509
bebió beber VMIS3S0 1
dos_litros VL_l:2 Zu 1
de de SP 0.999961
agua agua NCCS000 0.997446
. . Fp 1
"jose_perez" is detected by the named entity recognizer. You can deactivate it with option --noner, but that will cause that proper nouns are not marked as such. An alternative is to set the option <SplitMultiwords> to "yes" in the named entity recognizer configuration file (e.g. /usr/share/freeling/es/np.dat)
"martes_a_las_doce" is detected by the date time recognizer. You can deactivate it with option --nodate. That will cause that time expressions are not marked.
"treinta_y_dos" is detected by the numbers recognizer. You can deactivate it with option --nonumb. That will cause that numerical expressions are not recognized as such (not even single-word numbers or numbers in figures)
"al_poco_rato" is detected by the locutions recognized. You can deactivate it with option --noloc
"dos_litros" is detected by the quantitites recognized. You can deactivate it with option --noquant
In you examples there is also "calabazo" -> "cala_bazo"
"calabazo" is not a Spanish word (or if it is, it is not in FreeLing dictionary). So the analyzer tries to see if it is a compound word (like "aguafiestas", "verdeazulado", "tierra-aire", etc. In this case, it finds a match with two nouns "cala" and "bazo" and marks that.
This is done by the compound detection module. You can deactivate it with option --nocomp
You can include new words in the dictionary editing the file /usr/share/freeling/es/dicc.src