Output formatting in Freeling 4.0's analyzer.bat

Submitted by jgregores on Tue, 05/15/2018 - 23:33

I'm using the analyzer.bat to analyze literary texts and I would like to be able to distinguish dialogue from narration. To this end, I would need to know when paragraphs start or end. The thing is, the standard output from analyzer.bat puts an empty line between each sentence, but doesn't mark paragraphs at all from what I can understand.

Does the analyzer.bat actually mark paragraphs? If not, is there a way to change this without coding my own main program using the libraries?

Thank you in advance.

Yes there is.

Use options "--mode doc --flush --output xml" and you'll get it.  ("--output json" will work too if you prefer that format)

However, note that there is no standard way of creating a "paragraph". E.g. if you use a word processor such as MS-word or OpenOffice, each line is a paragraph, since the processor automatically wraps the lines.  But if the origin of the text is not from a word processor, there may be  newlines inside a paragraph, and the paragraph is probably marked by a blank line separating two text lines, or it may be just because the writer added some whitespaces indenting the next line.

In summary, "analyzer" considers there is a new paragraph when there is a blank line between two text lines.  Consecutive text lines are considered to be in the same paragraph. It does not go further than that, so depending on how your input, is formatted, it may not be enough.