CoNLL-U (Universal Dependencies)#
The rustling.conllu module provides tools for parsing
CoNLL-U files,
the standard format for Universal Dependencies datasets.
A CoNLL-U file is a plain-text, tab-separated format where sentences are
separated by blank lines. Each token line has 10 fields:
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, and MISC.
Comment lines start with #.
# sent_id = 1
# text = The cat sat on the mat.
1 The the DET DT Definite=Def|PronType=Art 2 det _ _
2 cat cat NOUN NN Number=Sing 3 nsubj _ _
3 sat sit VERB VBD Mood=Ind|Tense=Past 0 root _ _
4 on on ADP IN _ 6 case _ _
5 the the DET DT Definite=Def|PronType=Art 6 det _ _
6 mat mat NOUN NN Number=Sing 3 nmod _ _
7 . . PUNCT . _ 3 punct _ SpaceAfter=No
Loading Data#
read_conllu()#
The quickest way to load CoNLL-U data is with read_conllu().
It accepts a file path, directory, ZIP archive, git URL, or HTTP URL
and figures out the right loading strategy automatically:
import rustling
# From a local .conllu file
conllu = rustling.read_conllu("path/to/data.conllu")
# From a directory (recursively finds all .conllu files)
conllu = rustling.read_conllu("path/to/ud-treebank/")
# From a ZIP archive
conllu = rustling.read_conllu("path/to/treebank.zip")
# From a git repository (e.g., a Universal Dependencies treebank)
conllu = rustling.read_conllu("https://github.com/UniversalDependencies/UD_English-EWT.git")
# From a URL (ZIP files are automatically detected and extracted)
conllu = rustling.read_conllu("https://example.com/treebank.zip")
Using the class methods directly#
If you need finer control – for example, to pass specific files,
filter by regex, change the file extension, control caching, or parse
in-memory strings – use the CoNLLU class methods directly:
from rustling.conllu import CoNLLU
From specific files:
conllu = CoNLLU.from_files(["path/to/train.conllu", "path/to/test.conllu"])
From a directory with a regex filter:
conllu = CoNLLU.from_dir("path/to/treebank/", match=r"test")
The extension parameter controls which file extension to look for (default: ".conllu").
From a ZIP archive:
conllu = CoNLLU.from_zip("path/to/treebank.zip")
From a git repository:
conllu = CoNLLU.from_git("https://github.com/UniversalDependencies/UD_English-EWT.git")
From a URL (ZIP files are automatically detected and extracted):
conllu = CoNLLU.from_url("https://example.com/treebank.zip")
From in-memory strings:
conllu = CoNLLU.from_strs([conllu_string_1, conllu_string_2])
Parallel processing#
All loading methods accept a parallel parameter (default: True)
to enable parallel parsing of multiple files.
Accessing Data#
Sentences#
Call sentences() to get a flat list of all
sentences across all files:
import rustling
conllu = rustling.read_conllu("treebank.conllu")
for sentence in conllu.sentences():
print(sentence.comments) # list[str] or None
for token in sentence.tokens():
print(token.id, token.form, token.lemma, token.upos, token.deprel)
Tokens#
A Token has the following properties, corresponding
to the 10 CoNLL-U fields:
id– Word index (integer, range like"1-2"for multiword tokens, or decimal like"1.1"for empty nodes).form– Word form or punctuation symbol.lemma– Lemma or stem of the word.upos– Universal POS tag.xpos– Language-specific POS tag, or"_".feats– Morphological features, or"_".head– Head of the current word ("0"for root), or"_".deprel– Universal dependency relation to HEAD, or"_".deps– Enhanced dependency graph, or"_".misc– Any other annotation, or"_".
Converting to CHAT#
A CoNLLU reader can convert its data to CHAT format
for use with CHILDES / TalkBank tools.
import rustling
conllu = rustling.read_conllu("treebank.conllu")
# Convert to a CHAT object
chat = conllu.to_chat()
# Or get CHAT-formatted strings
chat_strs = conllu.to_chat_strs()
# Or write .cha files directly
conllu.to_chat_files("output_dir/")
The conversion maps CoNLL-U token fields to CHAT morphology and grammar tiers:
%mortier:UPOS|LEMMA(with&FEATSappended if features are present)%gratier:ID|HEAD|DEPREL
Since CoNLL-U files have no participant information, a default participant code
"SPK" (Speaker) is used.
Collection Operations#
A CoNLLU reader behaves like a collection of files.
You can iterate, slice, combine, and modify it:
import rustling
conllu = rustling.read_conllu("path/to/treebank/")
# File count and paths
print(conllu.n_files)
print(conllu.file_paths)
# Iteration and slicing
for single_file in conllu:
print(single_file.n_files) # 1
subset = conllu[0:3]
# Combining
combined = conllu1 + conllu2
conllu1 += conllu2
# Appending and extending
conllu1.append(conllu2)
conllu1.extend([conllu2, conllu3])
# Removing
last = conllu.pop()
first = conllu.pop_left()
conllu.clear()
Comments#
A
Sentencehas acommentsproperty that returns the comment lines (without the leading#), orNoneif there are no comments: