How to analyze an already tokenized file

Submitted by Anonymous (not verified) on Tue, 11/07/2017 - 20:40

Forums

I have already tried:
analyze --inplv token -f ca.cfg < orig.txt > target.txt analyze --inplv splitted -f ca.cfg < orig.txt > target.txt

And also changing at the ca.cfg file the original "InputLevel=text" to "InputLevel=token"
But it alwasys says:
Error - 'text' input format only accepts input analysis level 'text'.

Sorry, the thing is that I…

Sorry, the thing is that I don't want Freeling tokenizing some tags that are the already tokenized text, for instance <s id="1">, </s> and so on
Thanks

Command line options…

Command line options overwrite configuration file options. So, if you put them in the command line, you don't need to change the ca.cfg file

There are two options controlling the kind of input: --inplv controls to which extend is the input analyzed (tokenized, splitted, tagged, etc). The option --input controls which is the input format (plain text, freeling legacy format, conll, xml, etc)

The default --input is "text". Since your input is no longer plain text, but it has gone through some analysis stage (even if it is only a tokenizer), you should add the option "--input freeling" (legacy freeling format is the only one supporting tokenized level so far)

I get this:…

I get this:
analyze --input freeling -f ca.cfg < orig.txt > target.txt Error - 'freeling' input format only accepts input analysis levels 'splitted', 'morfo', 'tagged', and 'senses'.

Yes.As I said, you need to…

Yes.

As I said, you need to add the option "--input freeling" to the already present "--inplv token", and not replace it

No improvement though…

No improvement though...

inplv before input:analyze --inplv token --input freeling -f ca.cfg < orig.txt > target.txt Error - 'freeling' input format only accepts input analysis levels 'splitted', 'morfo', 'tagged', and 'senses'.

input before inplv: analyze --input freeling --inplv token -f ca.cfg < orig.txt > target.txt Error - 'freeling' input format only accepts input analysis levels 'splitted', 'morfo', 'tagged', and 'senses'.

Uhm I see... That means you…

Uhm I see... That means you'll need to feed it with splitted text. The format is the same than for "token" (i.e. one token per line) but with an additional blank line after the end of each sentence.

So, if you just add a blank line after each sentence end (includng the last one), it will work. (you'll have to use "--inplv splitted" in the command)

Situation remains the same,…

Situation remains the same, now my file is like this:
el

programa

compressor

Pkzip

But I still get this:
analyze --input freeling --inplv token -f ca.cfg < orig_bl.txt > target.txt Error - 'freeling' input format only accepts input analysis levels 'splitted', 'morfo', 'tagged', and 'senses'.

As I already said in my…

As I already said in my previous answer, --inplv must be "splitted".

Sorry, I didn't do a…

Sorry, I didn't do a thorough read of your comment, it seems to be working now, and I think this also solved the problem I explained at the other post "Error on analysis", I had to change the tags like this though: <contrac forma="al"> to Acontracformaigcialcdz, so Freeling analyzes them like this: Acontracformaigcialcdz acontracformaigcialcdz NP00000 1

The input must be PLAIN TEXT…

The input must be PLAIN TEXT. That means NO XML marks.
For instance, a correct input (for option --inplv splitted --input freeling) would be;
El gat menja peix .

El
gos
borda
.

So, you need to remove all meta information (such as "<contrac" or "form=") and leave ONLY THE TEXT. one word per line, one blank line after each sentence. (I already mentioned this several times... Please carefully read my posts and follow them before posting again)