ELAN Parsing#
The rustling.elan module provides tools for parsing
ELAN annotation files (.eaf).
Loading Data#
read_elan()#
The quickest way to load ELAN data is with read_elan().
It accepts a file path, directory, ZIP archive, git URL, or HTTP URL
and figures out the right loading strategy automatically:
import rustling
# From a local .eaf file
elan = rustling.read_elan("path/to/recording.eaf")
# From a directory (recursively finds all .eaf files)
elan = rustling.read_elan("path/to/corpus/")
# From a ZIP archive
elan = rustling.read_elan("path/to/corpus.zip")
# From a git repository
elan = rustling.read_elan("https://github.com/user/corpus.git")
# From a URL (ZIP files are automatically detected and extracted)
elan = rustling.read_elan("https://example.com/corpus.zip")
Using the class methods directly#
If you need finer control — for example, to pass specific files,
filter by regex, change the file extension, control caching, or parse
in-memory strings — use the ELAN class methods directly:
from rustling.elan import ELAN
From specific files:
elan = ELAN.from_files(["path/to/file1.eaf", "path/to/file2.eaf"])
From a directory with a regex filter:
elan = ELAN.from_dir("path/to/corpus/", match=r"speaker_01")
The extension parameter controls which file extension to look for (default: ".eaf").
From a ZIP archive:
elan = ELAN.from_zip("path/to/corpus.zip")
From a git repository:
elan = ELAN.from_git("https://github.com/user/corpus.git")
From a URL (ZIP files are automatically detected and extracted):
elan = ELAN.from_url("https://example.com/corpus.zip")
From in-memory strings:
elan = ELAN.from_strs([eaf_string_1, eaf_string_2])
Parallel processing#
All loading methods accept a parallel parameter (default: True)
to enable parallel parsing of multiple files.
Accessing Tiers and Annotations#
Each ELAN file contains annotation tiers.
Call tiers() to get a list of OrderedDict[str, Tier],
one per file:
import rustling
elan = rustling.read_elan("path/to/corpus/")
for file_tiers in elan.tiers():
for tier_id, tier in file_tiers.items():
print(tier_id, tier.participant, tier.linguistic_type_ref)
for annotation in tier.annotations:
print(f" [{annotation.start_time}-{annotation.end_time}] {annotation.value}")
A Tier has the following properties:
id– Tier ID (e.g.,"G-jyutping").participant– Participant name.annotator– Annotator name.linguistic_type_ref– Linguistic type reference.parent_id– Parent tier ID, orNonefor root tiers.child_ids– Child tier IDs, orNoneif no children.annotations– List ofAnnotationobjects.
An Annotation has:
id– Annotation ID (e.g.,"a1").start_time– Start time in milliseconds, orNoneif unresolvable.end_time– End time in milliseconds, orNoneif unresolvable.value– The annotation text content.parent_id– Parent annotation ID forREF_ANNOTATIONtypes, orNone.
Converting to CHAT#
An ELAN reader can convert its data to CHAT format
for use with CHILDES / TalkBank tools.
import rustling
elan = rustling.read_elan("recording.eaf")
# Convert to a CHAT object
chat = elan.to_chat()
# Or get CHAT-formatted strings
chat_strs = elan.to_chat_strs()
# Or write .cha files directly
elan.to_chat_files("output_dir/")
Tier mapping:
Parent (alignable) tiers become CHAT main tiers (e.g.,
*CHI:).Child tiers whose ID matches
{name}@{code}(e.g.,mor@CHI) become CHAT dependent tiers (e.g.,%mor:).ELAN
Tier.participantpopulates the CHAT@Participantsline.
Participant selection:
By default, only parent tiers with a 3-character ID are treated as
CHAT main tiers (matching the standard CHAT convention of 3-letter
participant codes like CHI, MOT, FAT).
To override this, pass the participants keyword argument:
# Use specific tier IDs as CHAT main tiers
chat = elan.to_chat(participants=["Speaker1", "Speaker2"])
# Also works with to_chat_strs and to_chat_files
elan.to_chat_files("output_dir/", participants=["Speaker1", "Speaker2"])
Converting to SRT#
An ELAN reader can convert its data to SRT
(SubRip Subtitle) format.
import rustling
elan = rustling.read_elan("recording.eaf")
# Convert to an SRT object
srt = elan.to_srt()
# Or get SRT-formatted strings
srt_strs = elan.to_srt_strs()
# Or write .srt files directly
elan.to_srt_files("output_dir/")
Mapping:
Each selected annotation with time marks becomes one subtitle block.
Annotations without time marks are skipped (SRT requires time ranges).
When multiple tiers are selected, the subtitle text is prefixed with the tier ID (e.g.,
"CHI: more cookie ."). For a single tier, no prefix is added.
Participant selection:
By default, only parent tiers with a 3-character ID are included
(matching the standard CHAT convention). To override this, pass
the participants keyword argument:
# Use specific tier IDs
srt = elan.to_srt(participants=["Speaker1", "Speaker2"])
# Also works with to_srt_strs and to_srt_files
elan.to_srt_files("output_dir/", participants=["Speaker1", "Speaker2"])
Converting to TextGrid#
An ELAN reader can convert its data to
TextGrid
format for use with Praat.
import rustling
elan = rustling.read_elan("recording.eaf")
# Convert to a TextGrid object
textgrid = elan.to_textgrid()
# Or get TextGrid-formatted strings
textgrid_strs = elan.to_textgrid_strs()
# Or write .TextGrid files directly
elan.to_textgrid_files("output_dir/")
Mapping:
Each ELAN tier becomes an IntervalTier.
Annotations without time marks are skipped.
Times are converted from milliseconds to seconds.
Collection Operations#
An ELAN reader behaves like a collection of files.
You can iterate, slice, combine, and modify it:
import rustling
elan = rustling.read_elan("path/to/corpus/")
# File count and paths
print(elan.n_files)
print(elan.file_paths)
# Iteration and slicing
for single_file in elan:
print(single_file.n_files) # 1
subset = elan[0:3]
# Combining
combined = elan1 + elan2
elan1 += elan2
# Appending and extending
elan1.append(elan2)
elan1.extend([elan2, elan3])
# Removing
last = elan.pop()
first = elan.pop_left()
elan.clear()