Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...

#include <tokenizer.h>

Public Member Functions
	tokenizer (const std::wstring &cfgfile)
	Constructor.
void	tokenize (const std::wstring &text, std::list< word > &lw) const
	tokenize string, leave result in lw
std::list< word >	tokenize (const std::wstring &text) const
	tokenize string, return result as list
void	tokenize (const std::wstring &text, unsigned long &offset, std::list< word > &lw) const
	tokenize string, updating byte offset. Leave results in lw.
std::list< word >	tokenize (const std::wstring &text, unsigned long &offset) const
	tokenize string, updating offset, return result as list
Private Attributes
std::set< std::wstring >	abrevs
	abreviations set (Dr. Mrs. etc. period is not separated)
std::list< std::pair < std::wstring, freeling::regexp > >	rules
	tokenization rules
std::map< std::wstring, int >	matches
	substrings to convert into tokens in each rule

Detailed Description

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.

Constructor & Destructor Documentation

freeling::tokenizer::tokenizer ( const std::wstring & cfgfile )

Constructor.

Create a tokenizer, using the abreviation and patterns file indicated in given options.

References freeling::config_file::add_section(), freeling::config_file::close(), ERROR_CRASH, freeling::config_file::get_content_line(), freeling::config_file::get_section(), freeling::config_file::open(), and TRACE.

Member Function Documentation

void freeling::tokenizer::tokenize	(	const std::wstring &	text,
		std::list< word > &	lw
	)		const

tokenize string, leave result in lw

std::list<word> freeling::tokenizer::tokenize ( const std::wstring & text ) const

tokenize string, return result as list

void freeling::tokenizer::tokenize	(	const std::wstring &	text,
		unsigned long &	offset,
		std::list< word > &	lw
	)		const

tokenize string, updating byte offset. Leave results in lw.

std::list<word> freeling::tokenizer::tokenize	(	const std::wstring &	text,
		unsigned long &	offset
	)		const

tokenize string, updating offset, return result as list

Member Data Documentation

std::set<std::wstring> freeling::tokenizer::abrevs [private]

abreviations set (Dr. Mrs. etc. period is not separated)

std::map<std::wstring,int> freeling::tokenizer::matches [private]

substrings to convert into tokens in each rule

std::list<std::pair<std::wstring, freeling::regexp> > freeling::tokenizer::rules [private]

tokenization rules

The documentation for this class was generated from the following files:

Public Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation