I'm using Freeling in my own program and it works great so far for Spanish apart from suffixes, so for words like these:
Dibújamelo.
¿Dígame?
Táchala.
Véndemelo.
Déjale.
It is returning Proper noun tag NP00000 and the word itself as the lemma.
What I don't understand is I am using the exact same code in the tutorial example and the same configuration and files as analyzer which does return the correct tags. The online demo also returns the correct tags. My code also returns no alternatives.
Snippet:
opt.UserMapFile = L"";
opt.LocutionsFile = path + L"locucions.dat";
opt.AffixFile = path + L"afixos.dat";
opt.ProbabilityFile = path + L"probabilitats.dat";
opt.DictionaryFile = path + L"dicc.src";
opt.NPdataFile = path + L"np.dat";
opt.PunctuationFile = path + L"../common/punct.dat";
morfo.set_active_options (false,// UserMap
true, // NumbersDetection,
true, // PunctuationDetection,
true, // DatesDetection,
true, // DictionarySearch,
true, // AffixAnalysis,
false, // CompoundAnalysis,
true, // RetokContractions,
true, // MultiwordsDetection,
true, // NERecognition,
false, // QuantitiesDetection,
true); // ProbabilityAssignment
What am I doing wrong? Thanks
Option management changed in 4.2
It seems you are following the tutorial, but those examples are not updated to latest 4.2 version
In 4.2, there is no class "maco_options", but a generic class "analyzer_config" that holds the options for all modules.
Also, there is a class "analyzer" that will create all required modules for you given a configuration. Also, some of the options can be activated/deactivated on the run
You can still use each module separately, but using "analyzer" class is much easier
Here you have a minimal program that will work with 4.2
Check class analyzer_config for more details about avaialble options
#include <iostream>
#include "freeling.h"
using namespace std;
//---------------------------------------------
// Do whatever is needed with analyzed sentences
//---------------------------------------------
void PrintResults(const freeling::document &doc) {
// for each paragraph in document
for (list<freeling::paragraph>::const_iterator ip=doc.begin(); ip!=doc.end(); ++ip) {
// for each sentence in paragraph
for (list<freeling::sentence>::const_iterator is=ip->begin(); is!=ip->end(); ++is) {
// for each word in sentence
for (freeling::sentence::const_iterator w=is->begin(); w!=is->end(); ++w) {
// print word form
wcout << L"word '" << w->get_form() << L"'" << endl;
// print possible analysis in word, output lemma and tag
wcout << L" Possible analysis: {";
for (freeling::word::const_iterator a=w->analysis_begin(); a!=w->analysis_end(); ++a)
wcout << L" (" << a->get_lemma() << L"," << a->get_tag() << L")";
wcout << L" }" << endl;
// print analysis selected by the tagger
wcout << L" Selected analysis: ("<< w->get_lemma() << L", " << w->get_tag() << L")" << endl;
}
// sentence separator
wcout << endl;
}
}
}
///////////// MAIN PROGRAM /////////////////////
int main (int argc, char **argv) {
// set locale to an UTF8 compatible locale
freeling::util::init_locale(L"default");
// get requested language from arg1, or English if not provided
wstring lang = L"en";
if (argc > 1) lang = freeling::util::string2wstring(argv[1]);
// get installation path to use from arg2, or use /usr/local if not provided
wstring ipath = L"/usr/local";
if (argc > 2) ipath = freeling::util::string2wstring(argv[2]);
// path to language data
wstring lpath = ipath+L"/share/freeling/"+lang+L"/";
// create analyzer options
freeling::analyzer_config cfg = freeling::analyzer_config();
// set up creation-time options (which modules will be loaded)
cfg.config_opt.Lang = lang;
cfg.config_opt.TOK_TokenizerFile = lpath+L"tokenizer.dat";
cfg.config_opt.SPLIT_SplitterFile = lpath+L"splitter.dat";
cfg.config_opt.MACO_LocutionsFile = lpath + L"locucions.dat";
cfg.config_opt.MACO_AffixFile = lpath + L"afixos.dat";
cfg.config_opt.MACO_ProbabilityFile = lpath + L"probabilitats.dat";
cfg.config_opt.MACO_DictionaryFile = lpath + L"dicc.src";
cfg.config_opt.MACO_NPDataFile = lpath + L"np.dat";
cfg.config_opt.MACO_PunctuationFile = lpath + L"../common/punct.dat";
cfg.config_opt.TAGGER_HMMFile = lpath+L"tagger.dat";
// set up run-time options (which modules/parameters will be used in
// subsequent calls. May be changed before each call to the analyzers
// e.g. if different sentences require different configurations)
cfg.invoke_opt.InputLevel = freeling::TEXT;
cfg.invoke_opt.OutputLevel = freeling::TAGGED;
cfg.invoke_opt.TAGGER_which = freeling::HMM;
cfg.invoke_opt.TAGGER_Retokenize = true;
cfg.invoke_opt.TAGGER_ForceSelect = freeling::RETOK;
cfg.invoke_opt.MACO_AffixAnalysis = true;
cfg.invoke_opt.MACO_MultiwordsDetection = true;
cfg.invoke_opt.MACO_NumbersDetection = true;
cfg.invoke_opt.MACO_PunctuationDetection = true;
cfg.invoke_opt.MACO_DatesDetection = true;
cfg.invoke_opt.MACO_QuantitiesDetection = true;
cfg.invoke_opt.MACO_DictionarySearch = true;
cfg.invoke_opt.MACO_ProbabilityAssignment = true;
cfg.invoke_opt.MACO_NERecognition = true;
cfg.invoke_opt.MACO_RetokContractions = true;
// Create analyzer modules with given config options
// invoke options can be changed anytime calling anlz.set_invoke_options(ivk_opt)
freeling::analyzer anlz = freeling::analyzer(cfg);
// get all input text in a single string
wstring text=L"";
wstring line;
while (getline(wcin,line))
text = text + line + L"\n";
// analyze given text, leave results in doc
freeling::document doc;
anlz.analyze(text, doc);
// do whatever is needed with processed document
PrintResults(doc);
}
Thanks for the quick…
Thanks for the quick response. I had to change some things for it to compile:
cfg.config_opt.TAGGER_Retokenize = true;
cfg.config_opt.TAGGER_ForceSelect = freeling::RETOK;
But the result is exactly the same:
word 'Dibújamelo'
Possible analysis: { (dibújamelo,NP00000) }
Selected analysis: (dibújamelo, NP00000)
What else could it be? Thanks
Ah, right, I gave you code…
Ah, right, I gave you code for the latest version in git master branch, which is slightly different than 4.2
apart from the change you made, you should replace line 89
freeling::analyzer anlz = freeling::analyzer(cfg);
with:
freeling::analyzer anlz = freeling::analyzer(cfg.config_opt);
anlz.set_current_invoke_options(cfg.invoke_opt);
Let me know how that works
I tried both 4.2 and master…
I tried both 4.2 and master version and it works for me in both cases, either activating TaggerRetokenize or not.
May it be an encoding issue? Are you on windows?
Try "Dibujamelo" without an accent. If that works then most likely your input is not UTF8
Yes sorry I did do that -…
Yes sorry I did do that - but I'm still getting the same results.
The iterator for "is" is also returning just one word for "Dibújamelo." and only one possible analysis. w->get_analysis().size() is 1
When debugging affixes::look_for_affixes_in_list seems to be working (its find "...melo" but I got a bit lost as to how that is fed back into the word iterator
Help?
Here's what the debugger…
Here's what the debugger shows for w:
lemma L"dibújamelo" std::wstring
tag L"NP00000" std::wstring
prob 1.0000000000000000 double
distance -1.0000000000000000 double
senses { size=0 } std::list,std::allocator>>
retok { size=0 } std::list>
selected_kbest { size=1 } std::set,std::allocator>
user { size=0 } std::vector>
w->get_analyzed_by() returns 0x0900
The code I posted works for…
The code I posted works for me, and I can think of any reason why it should work differently on your computer.
0x0900 means that this word was analyzed by the NER module and the probabilities module.
So, it did not go through the dictionary and affixes module (or they did not find an analysis for it)
(module codes are in "language.h")
I am really confused... The only suggestion I have is to activate the tracing system and see what is going on...
Thanks lluisp Sorry I missed…
Thanks lluisp Sorry I missed your comment earlier, yes I did compile Freeling on Windows and my application also.
When I tried Dibujamelo without the accent it worked. The thing is I'm not doing anything fancy, just
wstring strSpanish = L"Dibújamelo.";
anlz.analyze(strSpanish , doc);
and checking inside strSpanish shows the correct UTF8 encoding for the ú (U+00FA) so where to begin to remedy that?
then it definitely is an…
then it definitely is an encoding problem.
you should add an instruction at the beggining of your program:
freeling::util::init_locale("en_US.UTF-8")
(make sure the locale in installed on your system and that this is the right name for it in windows. You can set any other utf8 locale too e.g. "es_ES.UTF-8" or the like)
Thanks Luis I did add that…
Thanks Luis I did add that but it made no difference.
I have done some more debugging and regular expressions in accents_es.cc are not working.
e.g in this line:
else if (not is_accentued_esp(s) and (not is_monosylabic(s) or is_diacritic_esp(s)) ) {
TRACE(3,L"No accents, not monosylabic (or diacritic). Add accent to last vowel.");
s = put_accent_esp(s);
Dibújamelo comes back as not accented! and then the function
accents_es::put_accent_esp
changes it toL"dibjújamé"
for some reasonI have compiled Freeling as MultiCharacter Byte and Unicode it makes no difference, the regular expressions always do this.
Thoughts?
I really don't use windows,…
I really don't use windows, and I can not help you much there... International standards and Windows typically don't mix well, and UTF encodings are one recurrent reminder of that.
Did you try using the FreeLing precompiled windows versions available in github?
As an alternative, maybe you can build freeling for linux inside the WLS ...
I thought of another…
I thought of another possible cause: Where do you get your input from?
the init_locale function sets up encoding for stdin and stdout, but if you read from a file, probably the encoding is not set to UTF8
You need to "imbue" your input stream (and output if you write to a file) with the appropriate encoding.
check how it is done for cin/cout in init_locale, which can be found in file src/libfreeling/util.cc
Thanks Luis it was the…
Thanks Luis it was the silliest thing - I had to open the 4 files that dealt with Spanish (with _es suffix) and change the encoding on them from UTF8 to UTF8+BOM and save. Just another day (or a week!) in the life of a Windows coder! Muchas Gracias!