ELAN Parsing#

The rustling.elan module provides tools for parsing ELAN annotation files (.eaf).

Loading Data#

read_elan()#

The quickest way to load ELAN data is with read_elan(). It accepts a file path, directory, ZIP archive, git URL, or HTTP URL and figures out the right loading strategy automatically:

import rustling

# From a local .eaf file
elan = rustling.read_elan("path/to/recording.eaf")

# From a directory (recursively finds all .eaf files)
elan = rustling.read_elan("path/to/corpus/")

# From a ZIP archive
elan = rustling.read_elan("path/to/corpus.zip")

# From a git repository
elan = rustling.read_elan("https://github.com/user/corpus.git")

# From a URL (ZIP files are automatically detected and extracted)
elan = rustling.read_elan("https://example.com/corpus.zip")

Using the class methods directly#

If you need finer control — for example, to pass specific files, filter by regex, change the file extension, control caching, or parse in-memory strings — use the ELAN class methods directly:

from rustling.elan import ELAN

From specific files:

elan = ELAN.from_files(["path/to/file1.eaf", "path/to/file2.eaf"])

From a directory with a regex filter:

elan = ELAN.from_dir("path/to/corpus/", match=r"speaker_01")

The extension parameter controls which file extension to look for (default: ".eaf").

From a ZIP archive:

elan = ELAN.from_zip("path/to/corpus.zip")

From a git repository:

elan = ELAN.from_git("https://github.com/user/corpus.git")

From a URL (ZIP files are automatically detected and extracted):

elan = ELAN.from_url("https://example.com/corpus.zip")

From in-memory strings:

elan = ELAN.from_strs([eaf_string_1, eaf_string_2])

Parallel processing#

All loading methods accept a parallel parameter (default: True) to enable parallel parsing of multiple files.

Accessing Tiers and Annotations#

Each ELAN file contains annotation tiers. Call tiers() to get a list of OrderedDict[str, Tier], one per file:

import rustling

elan = rustling.read_elan("path/to/corpus/")

for file_tiers in elan.tiers():
    for tier_id, tier in file_tiers.items():
        print(tier_id, tier.participant, tier.linguistic_type_ref)
        for annotation in tier.annotations:
            print(f"  [{annotation.start_time}-{annotation.end_time}] {annotation.value}")

A Tier has the following properties:

  • id – Tier ID (e.g., "G-jyutping").

  • participant – Participant name.

  • annotator – Annotator name.

  • linguistic_type_ref – Linguistic type reference.

  • parent_id – Parent tier ID, or None for root tiers.

  • child_ids – Child tier IDs, or None if no children.

  • annotations – List of Annotation objects.

An Annotation has:

  • id – Annotation ID (e.g., "a1").

  • start_time – Start time in milliseconds, or None if unresolvable.

  • end_time – End time in milliseconds, or None if unresolvable.

  • value – The annotation text content.

  • parent_id – Parent annotation ID for REF_ANNOTATION types, or None.

Converting to CHAT#

An ELAN reader can convert its data to CHAT format for use with CHILDES / TalkBank tools.

import rustling

elan = rustling.read_elan("recording.eaf")

# Convert to a CHAT object
chat = elan.to_chat()

# Or get CHAT-formatted strings
chat_strs = elan.to_chat_strs()

# Or write .cha files directly
elan.to_chat_files("output_dir/")

Tier mapping:

  • Parent (alignable) tiers become CHAT main tiers (e.g., *CHI:).

  • Child tiers whose ID matches {name}@{code} (e.g., mor@CHI) become CHAT dependent tiers (e.g., %mor:).

  • ELAN Tier.participant populates the CHAT @Participants line.

Participant selection:

By default, only parent tiers with a 3-character ID are treated as CHAT main tiers (matching the standard CHAT convention of 3-letter participant codes like CHI, MOT, FAT). To override this, pass the participants keyword argument:

# Use specific tier IDs as CHAT main tiers
chat = elan.to_chat(participants=["Speaker1", "Speaker2"])

# Also works with to_chat_strs and to_chat_files
elan.to_chat_files("output_dir/", participants=["Speaker1", "Speaker2"])

Converting to SRT#

An ELAN reader can convert its data to SRT (SubRip Subtitle) format.

import rustling

elan = rustling.read_elan("recording.eaf")

# Convert to an SRT object
srt = elan.to_srt()

# Or get SRT-formatted strings
srt_strs = elan.to_srt_strs()

# Or write .srt files directly
elan.to_srt_files("output_dir/")

Mapping:

  • Each selected annotation with time marks becomes one subtitle block.

  • Annotations without time marks are skipped (SRT requires time ranges).

  • When multiple tiers are selected, the subtitle text is prefixed with the tier ID (e.g., "CHI: more cookie ."). For a single tier, no prefix is added.

Participant selection:

By default, only parent tiers with a 3-character ID are included (matching the standard CHAT convention). To override this, pass the participants keyword argument:

# Use specific tier IDs
srt = elan.to_srt(participants=["Speaker1", "Speaker2"])

# Also works with to_srt_strs and to_srt_files
elan.to_srt_files("output_dir/", participants=["Speaker1", "Speaker2"])

Converting to TextGrid#

An ELAN reader can convert its data to TextGrid format for use with Praat.

import rustling

elan = rustling.read_elan("recording.eaf")

# Convert to a TextGrid object
textgrid = elan.to_textgrid()

# Or get TextGrid-formatted strings
textgrid_strs = elan.to_textgrid_strs()

# Or write .TextGrid files directly
elan.to_textgrid_files("output_dir/")

Mapping:

  • Each ELAN tier becomes an IntervalTier.

  • Annotations without time marks are skipped.

  • Times are converted from milliseconds to seconds.

Collection Operations#

An ELAN reader behaves like a collection of files. You can iterate, slice, combine, and modify it:

import rustling

elan = rustling.read_elan("path/to/corpus/")

# File count and paths
print(elan.n_files)
print(elan.file_paths)

# Iteration and slicing
for single_file in elan:
    print(single_file.n_files)  # 1

subset = elan[0:3]

# Combining
combined = elan1 + elan2
elan1 += elan2

# Appending and extending
elan1.append(elan2)
elan1.extend([elan2, elan3])

# Removing
last = elan.pop()
first = elan.pop_left()
elan.clear()