Gathering paragraphs compiling Freeling on Python

Submitted by adriancampillo on Thu, 09/26/2019 - 10:09
Forums

Good morning Forum,

I have noticed that Freeling has several functionss related to lists of paragraphs but, the examples only focus on sentences and words, so it's difficult working with paragraphs.
Can I have any help with this, extracting paragraphs from a text?

This is wonderful to work with sentences:

## -----------------------------------------------
## Do whatever is needed with analyzed sentences
## -----------------------------------------------
def ProcessSentences(ls):

# for each sentence in list
for s in ls :
# for each word in sentence
for w in s :
# print word form
print("word '"+w.get_form()+"'")
# print possible analysis in word, output lemma and tag
print(" Possible analysis: {",end="")
for a in w :
print(" ("+a.get_lemma()+","+a.get_tag()+")",end="")
print(" }")
# print analysis selected by the tagger
print(" Selected Analysis: ("+w.get_lemma()+","+w.get_tag()+")")
# sentence separator
print("")

## -----------------------------------------------
## Set desired options for morphological analyzer
## -----------------------------------------------
def my_maco_options(lang,lpath) :

# create options holder
opt = pyfreeling.maco_options(lang);

# Provide files for morphological submodules. Note that it is not
# necessary to set file for modules that will not be used.
opt.UserMapFile = "";
opt.LocutionsFile = lpath + "locucions.dat";
opt.AffixFile = lpath + "afixos.dat";
opt.ProbabilityFile = lpath + "probabilitats.dat";
opt.DictionaryFile = lpath + "dicc.src";
opt.NPdataFile = lpath + "np.dat";
opt.PunctuationFile = lpath + "../common/punct.dat";
return opt;

## ----------------------------------------------
## ------------- MAIN PROGRAM ---------------
## ----------------------------------------------

# set locale to an UTF8 compatible locale
pyfreeling.util_init_locale("default");

# get requested language from arg1, or English if not provided
lang = "en"
if len(sys.argv)>1 : lang=sys.argv[1]

# get installation path to use from arg2, or use /usr/local if not provided
ipath = "/usr/local";
if len(sys.argv)>2 : ipath=sys.argv[2]

# path to language data
lpath = ipath + "/share/freeling/" + lang + "/"

# create analyzers
tk=pyfreeling.tokenizer(lpath+"tokenizer.dat");
sp=pyfreeling.splitter(lpath+"splitter.dat");

# create the analyzer with the required set of maco_options
morfo=pyfreeling.maco(my_maco_options(lang,lpath));
# then, (de)activate required modules
morfo.set_active_options (False, # UserMap
True, # NumbersDetection,
True, # PunctuationDetection,
True, # DatesDetection,
True, # DictionarySearch,
True, # AffixAnalysis,
False, # CompoundAnalysis,
True, # RetokContractions,
True, # MultiwordsDetection,
True, # NERecognition,
False, # QuantitiesDetection,
True); # ProbabilityAssignment

# create tagger
tagger = pyfreeling.hmm_tagger(lpath+"tagger.dat",True,2)

# process input text
text = "".join(sys.stdin.readlines())

# tokenize input line into a list of words
lw = tk.tokenize(text)
# split list of words in sentences, return list of sentences
ls = sp.split(lw)

# perform morphosyntactic analysis and disambiguation
ls = morfo.analyze(ls)
ls = tagger.analyze(ls)

# do whatever is needed with processed sentences
ProcessSentences(ls)

Thank you very much

FreeLing does not detect paragraphs, because there is too much variation on text formatting to establish a general heuristic.  For instance, some texts separate lines with one newline, and paragraphs with two newlines.  But other texts do not separate lines, and a newline means "new paragraph" (eg. texts written in a document editor typically have this format).

So, you should handle paragraphs yourself in your code.  If you know that a newline marks a paragraph in your text, just send each line to the splitter with flush=true, and create a paragraph with the resulting list of sentences. If your paragraphs are marked with two newlines, just read lines until you find two newlines, send all that string to the splitter, and create a paragraph with the results.