ELAN Parsing#

The rustling.elan module provides tools for parsing ELAN annotation files (.eaf).

Loading Data#

`read_elan()`#

The quickest way to load ELAN data is with read_elan(). It accepts a file path, directory, ZIP archive, git URL, or HTTP URL and figures out the right loading strategy automatically:

import rustling

# From a local .eaf file
elan = rustling.read_elan("path/to/recording.eaf")

# From a directory (recursively finds all .eaf files)
elan = rustling.read_elan("path/to/corpus/")

# From a ZIP archive
elan = rustling.read_elan("path/to/corpus.zip")

# From a git repository
elan = rustling.read_elan("https://github.com/user/corpus.git")

# From a URL (ZIP files are automatically detected and extracted)
elan = rustling.read_elan("https://example.com/corpus.zip")

Using the class methods directly#

If you need finer control — for example, to pass specific files, filter by regex, change the file extension, control caching, or parse in-memory strings — use the ELAN class methods directly:

from rustling.elan import ELAN

From specific files:

elan = ELAN.from_files(["path/to/file1.eaf", "path/to/file2.eaf"])

From a directory with a regex filter:

elan = ELAN.from_dir("path/to/corpus/", match=r"speaker_01")

The extension parameter controls which file extension to look for (default: ".eaf").

From a ZIP archive:

elan = ELAN.from_zip("path/to/corpus.zip")

From a git repository:

elan = ELAN.from_git("https://github.com/user/corpus.git")

From a URL (ZIP files are automatically detected and extracted):

elan = ELAN.from_url("https://example.com/corpus.zip")

From in-memory strings:

elan = ELAN.from_strs([eaf_string_1, eaf_string_2])

Parallel processing#

All loading methods accept a parallel parameter (default: True) to enable parallel parsing of multiple files.

Accessing Tiers and Annotations#

Each ELAN file contains annotation tiers. Call tiers() to get a list of OrderedDict[str, Tier], one per file:

import rustling

elan = rustling.read_elan("path/to/corpus/")

for file_tiers in elan.tiers():
    for tier_id, tier in file_tiers.items():
        print(tier_id, tier.participant, tier.linguistic_type_ref)
        for annotation in tier.annotations:
            print(f"  [{annotation.start_time}-{annotation.end_time}] {annotation.value}")

A Tier has the following properties:

id – Tier ID (e.g., "G-jyutping").
participant – Participant name.
annotator – Annotator name.
linguistic_type_ref – Linguistic type reference.
parent_id – Parent tier ID, or None for root tiers.
child_ids – Child tier IDs, or None if no children.
annotations – List of Annotation objects.

An Annotation has:

id – Annotation ID (e.g., "a1").
start_time – Start time in milliseconds, or None if unresolvable.
end_time – End time in milliseconds, or None if unresolvable.
value – The annotation text content.
parent_id – Parent annotation ID for REF_ANNOTATION types, or None.

Converting to CHAT#

An ELAN reader can convert its data to CHAT format for use with CHILDES / TalkBank tools.

import rustling

elan = rustling.read_elan("recording.eaf")

# Convert to a CHAT object
chat = elan.to_chat()

# Or get CHAT-formatted strings
chat_strs = elan.to_chat_strs()

# Or write .cha files directly
elan.to_chat_files("output_dir/")

Tier mapping:

Parent (alignable) tiers become CHAT main tiers (e.g., *CHI:).
Child tiers whose ID matches {name}@{code} (e.g., mor@CHI) become CHAT dependent tiers (e.g., %mor:).
ELAN Tier.participant populates the CHAT @Participants line.

Participant selection:

By default, only parent tiers with a 3-character ID are treated as CHAT main tiers (matching the standard CHAT convention of 3-letter participant codes like CHI, MOT, FAT). To override this, pass the participants keyword argument:

# Use specific tier IDs as CHAT main tiers
chat = elan.to_chat(participants=["Speaker1", "Speaker2"])

# Also works with to_chat_strs and to_chat_files
elan.to_chat_files("output_dir/", participants=["Speaker1", "Speaker2"])

Converting to SRT#

An ELAN reader can convert its data to SRT (SubRip Subtitle) format.

import rustling

elan = rustling.read_elan("recording.eaf")

# Convert to an SRT object
srt = elan.to_srt()

# Or get SRT-formatted strings
srt_strs = elan.to_srt_strs()

# Or write .srt files directly
elan.to_srt_files("output_dir/")

Mapping:

Each selected annotation with time marks becomes one subtitle block.
Annotations without time marks are skipped (SRT requires time ranges).
When multiple tiers are selected, the subtitle text is prefixed with the tier ID (e.g., "CHI: more cookie ."). For a single tier, no prefix is added.

Participant selection:

By default, only parent tiers with a 3-character ID are included (matching the standard CHAT convention). To override this, pass the participants keyword argument:

# Use specific tier IDs
srt = elan.to_srt(participants=["Speaker1", "Speaker2"])

# Also works with to_srt_strs and to_srt_files
elan.to_srt_files("output_dir/", participants=["Speaker1", "Speaker2"])

Converting to TextGrid#

An ELAN reader can convert its data to TextGrid format for use with Praat.

import rustling

elan = rustling.read_elan("recording.eaf")

# Convert to a TextGrid object
textgrid = elan.to_textgrid()

# Or get TextGrid-formatted strings
textgrid_strs = elan.to_textgrid_strs()

# Or write .TextGrid files directly
elan.to_textgrid_files("output_dir/")