FreeLing  4.0
Public Member Functions | Private Attributes
freeling::tokenizer Class Reference

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...

#include <tokenizer.h>

List of all members.

Public Member Functions

 tokenizer (const std::wstring &cfgfile)
 Constructor.
void tokenize (const std::wstring &text, std::list< word > &lw) const
 tokenize string, leave result in lw
std::list< wordtokenize (const std::wstring &text) const
 tokenize string, return result as list
void tokenize (const std::wstring &text, unsigned long &offset, std::list< word > &lw) const
 tokenize string, updating byte offset. Leave results in lw.
std::list< wordtokenize (const std::wstring &text, unsigned long &offset) const
 tokenize string, updating offset, return result as list

Private Attributes

std::set< std::wstring > abrevs
 abreviations set (Dr. Mrs. etc. period is not separated)
std::list< std::pair
< std::wstring,
freeling::regexp > > 
rules
 tokenization rules
std::map< std::wstring, intmatches
 substrings to convert into tokens in each rule

Detailed Description

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.


Constructor & Destructor Documentation

freeling::tokenizer::tokenizer ( const std::wstring &  cfgfile)

Constructor.

Create a tokenizer, using the abreviation and patterns file indicated in given options.

References freeling::config_file::add_section(), freeling::config_file::close(), ERROR_CRASH, freeling::config_file::get_content_line(), freeling::config_file::get_section(), freeling::config_file::open(), and TRACE.


Member Function Documentation

void freeling::tokenizer::tokenize ( const std::wstring &  text,
std::list< word > &  lw 
) const

tokenize string, leave result in lw

std::list<word> freeling::tokenizer::tokenize ( const std::wstring &  text) const

tokenize string, return result as list

void freeling::tokenizer::tokenize ( const std::wstring &  text,
unsigned long &  offset,
std::list< word > &  lw 
) const

tokenize string, updating byte offset. Leave results in lw.

std::list<word> freeling::tokenizer::tokenize ( const std::wstring &  text,
unsigned long &  offset 
) const

tokenize string, updating offset, return result as list


Member Data Documentation

std::set<std::wstring> freeling::tokenizer::abrevs [private]

abreviations set (Dr. Mrs. etc. period is not separated)

std::map<std::wstring,int> freeling::tokenizer::matches [private]

substrings to convert into tokens in each rule

std::list<std::pair<std::wstring, freeling::regexp> > freeling::tokenizer::rules [private]

tokenization rules


The documentation for this class was generated from the following files: