Definition of CoNLL data format

by sandersn 2. June 2009 09:02

Here is a complete definition of the CoNLL 2007 data format. It's on a wiki, so there's a chance it could be incorrect.

However, the last editor is Joakim Nivre, author of MaltParser, so it's probably right.

On a side note, UTF-8 is the specified encoding, so I need to change my build script which currently standardises on ISO-8859-1 (latin1) early on.

An inquiring mind
6/2/2009 11:48:10 AM #

What's the purpose of the new format?<br />
jpg and bmp are for pictures<br />
.txt is for text<br />
I didn't see what this thing's purpose was.  I'm going to guess it's either linguistics or programming related, but I don't know..

Nathan Sanders
6/2/2009 2:53:18 PM #

Yes, CoNLL format is a text format for storing sentence structure, like Subject, Predicate, Direct Object. It's not XML, but the linguistics world hasn't switched to XML for the most part.<br />
<br />
This is a note to myself, and to Google. So it may not be very interesting to regular readers.

