FreeLing
4.0
|
Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...
#include <tokenizer.h>
Public Member Functions | |
tokenizer (const std::wstring &cfgfile) | |
Constructor. | |
void | tokenize (const std::wstring &text, std::list< word > &lw) const |
tokenize string, leave result in lw | |
std::list< word > | tokenize (const std::wstring &text) const |
tokenize string, return result as list | |
void | tokenize (const std::wstring &text, unsigned long &offset, std::list< word > &lw) const |
tokenize string, updating byte offset. Leave results in lw. | |
std::list< word > | tokenize (const std::wstring &text, unsigned long &offset) const |
tokenize string, updating offset, return result as list | |
Private Attributes | |
std::set< std::wstring > | abrevs |
abreviations set (Dr. Mrs. etc. period is not separated) | |
std::list< std::pair < std::wstring, freeling::regexp > > | rules |
tokenization rules | |
std::map< std::wstring, int > | matches |
substrings to convert into tokens in each rule |
Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.
freeling::tokenizer::tokenizer | ( | const std::wstring & | cfgfile | ) |
Constructor.
Create a tokenizer, using the abreviation and patterns file indicated in given options.
References freeling::config_file::add_section(), freeling::config_file::close(), ERROR_CRASH, freeling::config_file::get_content_line(), freeling::config_file::get_section(), freeling::config_file::open(), and TRACE.
void freeling::tokenizer::tokenize | ( | const std::wstring & | text, |
std::list< word > & | lw | ||
) | const |
tokenize string, leave result in lw
std::list<word> freeling::tokenizer::tokenize | ( | const std::wstring & | text | ) | const |
tokenize string, return result as list
void freeling::tokenizer::tokenize | ( | const std::wstring & | text, |
unsigned long & | offset, | ||
std::list< word > & | lw | ||
) | const |
tokenize string, updating byte offset. Leave results in lw.
std::list<word> freeling::tokenizer::tokenize | ( | const std::wstring & | text, |
unsigned long & | offset | ||
) | const |
tokenize string, updating offset, return result as list
std::set<std::wstring> freeling::tokenizer::abrevs [private] |
abreviations set (Dr. Mrs. etc. period is not separated)
std::map<std::wstring,int> freeling::tokenizer::matches [private] |
substrings to convert into tokens in each rule
std::list<std::pair<std::wstring, freeling::regexp> > freeling::tokenizer::rules [private] |
tokenization rules