rustling.wordseg#

Word segmentation.

Package Contents#

class rustling.wordseg.DAGHMMSegmenter(*, n_iter: int | None = None, tolerance: float | None = None, gamma: float | None = None, random_seed: int | None = None, features: Sequence[rustling.seq_feature.SeqFeatureTemplate] | None = None)#

A DAG + HMM hybrid word segmenter (jieba-style).

Layer 1: Dictionary-based DAG with backward dynamic programming. Layer 2: HMM fallback (BMES tagger) for out-of-vocabulary spans.

fit_segmented(sents: Sequence[Sequence[str]]) None#

Train the model with supervised segmented sentences.

Builds the dictionary from word frequencies and trains the HMM component on the same data.

Parameters:

sents – An iterable of segmented sentences (each sentence is a sequence of words).

fit_unsegmented(sent_strs: Sequence[str]) None#

Refine the HMM component with unsupervised EM.

Parameters:

sent_strs – An iterable of unsegmented sentences.

score(sents: Sequence[Sequence[str]]) list[float]#

Compute log-likelihood of segmented sentences under the model.

Uses the Forward algorithm on the HMM component.

Parameters:

sents – Segmented sentences (each sentence is a sequence of words).

Returns:

Log-likelihood for each sentence.

Raises:

ValueError – If the model has not been fitted.

predict(sent_strs: Sequence[str], *, offsets: Literal[False] = False) list[list[str]]#
predict(sent_strs: Sequence[str], *, offsets: Literal[True]) list[list[tuple[str, tuple[int, int]]]]
predict(sent_strs: Sequence[str], *, offsets: bool = False) list[list[str]] | list[list[tuple[str, tuple[int, int]]]]

Segment the given unsegmented sentences.

Parameters:
  • sent_strs – An iterable of unsegmented sentences.

  • offsets – If True, return each word as a tuple of (word, (start, end)) where start and end are character indices (exclusive end, like Python slices).

Returns:

A list of segmented sentences. When offsets is True, each word is a (word, (start, end)) tuple.

save(path: str | os.PathLike[str], metadata: dict[str, str]) None#

Save the model and metadata to a zstd-compressed FlatBuffers binary.

Parameters:
  • path – The file path to save the model to. The file extension name .fb.zst is recommended.

  • metadata – Arbitrary key-value metadata to store alongside the model (e.g., PUA character mappings).

load(path: str | os.PathLike[str]) dict[str, str]#

Load a model and metadata from a binary file.

Parameters:

path – The file path to load the model from.

Returns:

The metadata dictionary stored in the file.

class rustling.wordseg.HiddenMarkovModelSegmenter(*, n_iter: int = 1, tolerance: float = 0.0, gamma: float = 1.0, random_seed: int | None = None, features: Sequence[rustling.seq_feature.SeqFeatureTemplate] | None = None)#

An HMM-based word segmenter using supervised BMES tagging.

This model uses a Hidden Markov Model where the hidden states are BMES (Begin/Middle/End/Single) labels and the observations are characters. Training directly computes HMM parameters from supervised data. Decoding uses the Viterbi algorithm.

fit_segmented(sents: Sequence[Sequence[str]]) None#

Train the model with supervised segmented sentences.

No cleaning or preprocessing (e.g., normalizing upper/lowercase, tokenization) is performed on the training data.

Parameters:

sents – An iterable of segmented sentences (each sentence is a sequence of words).

fit_unsegmented(sent_strs: Sequence[str]) None#

Train the model with unsupervised unsegmented sentences.

Uses the Baum-Welch (EM) algorithm. If the model was previously fitted (e.g., via fit_segmented), the existing parameters serve as EM initialization (warm start).

Parameters:

sent_strs – An iterable of unsegmented sentences.

score(sents: Sequence[Sequence[str]]) list[float]#

Compute log-likelihood of segmented sentences under the model.

Uses the Forward algorithm on the underlying HMM.

Parameters:

sents – Segmented sentences (each sentence is a sequence of words).

Returns:

Log-likelihood for each sentence.

Raises:

ValueError – If the model has not been fitted.

predict(sent_strs: Sequence[str], *, offsets: Literal[False] = False) list[list[str]]#
predict(sent_strs: Sequence[str], *, offsets: Literal[True]) list[list[tuple[str, tuple[int, int]]]]
predict(sent_strs: Sequence[str], *, offsets: bool = False) list[list[str]] | list[list[tuple[str, tuple[int, int]]]]

Segment the given unsegmented sentences.

Parameters:
  • sent_strs – An iterable of unsegmented sentences.

  • offsets – If True, return each word as a tuple of (word, (start, end)) where start and end are character indices (exclusive end, like Python slices).

Returns:

A list of segmented sentences. When offsets is True, each word is a (word, (start, end)) tuple.

save(path: str | os.PathLike[str]) None#

Save the model to a zstd-compressed FlatBuffers binary.

Parameters:

path – The path where the model will be saved. The file extension name .fb.zst is recommended.

load(path: str | os.PathLike[str]) None#

Load a model.

Parameters:

path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.

class rustling.wordseg.LongestStringMatching(*, max_word_length: int)#

Longest string matching segmenter.

This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.

fit(sents: Sequence[Sequence[str]]) None#

Train the model with the input segmented sentences.

No cleaning or preprocessing (e.g., normalizing upper/lowercase, tokenization) is performed on the training data.

Parameters:

sents – An iterable of segmented sentences (each sentence is a sequence of words).

predict(sent_strs: Sequence[str], *, offsets: Literal[False] = False) list[list[str]]#
predict(sent_strs: Sequence[str], *, offsets: Literal[True]) list[list[tuple[str, tuple[int, int]]]]
predict(sent_strs: Sequence[str], *, offsets: bool = False) list[list[str]] | list[list[tuple[str, tuple[int, int]]]]

Segment the given unsegmented sentences.

Parameters:
  • sent_strs – An iterable of unsegmented sentences.

  • offsets – If True, return each word as a tuple of (word, (start, end)) where start and end are character indices (exclusive end, like Python slices).

Returns:

A list of segmented sentences. When offsets is True, each word is a (word, (start, end)) tuple.

save(path: str | os.PathLike[str]) None#

Save the model to a zstd-compressed FlatBuffers binary.

Parameters:

path – The path where the model will be saved. The file extension name .fb.zst is recommended.

load(path: str | os.PathLike[str]) None#

Load a model.

Parameters:

path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.

class rustling.wordseg.RandomSegmenter(*, prob: float)#

A random segmenter.

Segmentation is predicted at random at each potential word boundary independently for a given probability. No training is required.

predict(sent_strs: Sequence[str], *, offsets: Literal[False] = False) list[list[str]]#
predict(sent_strs: Sequence[str], *, offsets: Literal[True]) list[list[tuple[str, tuple[int, int]]]]
predict(sent_strs: Sequence[str], *, offsets: bool = False) list[list[str]] | list[list[tuple[str, tuple[int, int]]]]

Segment the given unsegmented sentences.

Parameters:
  • sent_strs – An iterable of unsegmented sentences.

  • offsets – If True, return each word as a tuple of (word, (start, end)) where start and end are character indices (exclusive end, like Python slices).

Returns:

A list of segmented sentences. When offsets is True, each word is a (word, (start, end)) tuple.