rustling.wordseg#
Word segmentation.
Package Contents#
- class rustling.wordseg.DAGHMMSegmenter(*, n_iter: int | None = None, tolerance: float | None = None, gamma: float | None = None, random_seed: int | None = None, features: Sequence[rustling.seq_feature.SeqFeatureTemplate] | None = None)#
A DAG + HMM hybrid word segmenter (jieba-style).
Layer 1: Dictionary-based DAG with backward dynamic programming. Layer 2: HMM fallback (BMES tagger) for out-of-vocabulary spans.
- fit_segmented(sents: Sequence[Sequence[str]]) None#
Train the model with supervised segmented sentences.
Builds the dictionary from word frequencies and trains the HMM component on the same data.
- Parameters:
sents – An iterable of segmented sentences (each sentence is a sequence of words).
- fit_unsegmented(sent_strs: Sequence[str]) None#
Refine the HMM component with unsupervised EM.
- Parameters:
sent_strs – An iterable of unsegmented sentences.
- predict(sent_strs: Sequence[str]) list[list[str]]#
Segment the given unsegmented sentences.
- Parameters:
sent_strs – An iterable of unsegmented sentences.
- Returns:
A list of segmented sentences.
- save(path: str | os.PathLike[str], metadata: dict[str, str]) None#
Save the model and metadata to a zstd-compressed FlatBuffers binary.
- Parameters:
path – The file path to save the model to. The file extension name
.fb.zstis recommended.metadata – Arbitrary key-value metadata to store alongside the model (e.g., PUA character mappings).
- class rustling.wordseg.HiddenMarkovModelSegmenter(*, n_iter: int = 1, tolerance: float = 0.0, gamma: float = 1.0, random_seed: int | None = None, features: Sequence[rustling.seq_feature.SeqFeatureTemplate] | None = None)#
An HMM-based word segmenter using supervised BMES tagging.
This model uses a Hidden Markov Model where the hidden states are BMES (Begin/Middle/End/Single) labels and the observations are characters. Training directly computes HMM parameters from supervised data. Decoding uses the Viterbi algorithm.
- fit_segmented(sents: Sequence[Sequence[str]]) None#
Train the model with supervised segmented sentences.
No cleaning or preprocessing (e.g., normalizing upper/lowercase, tokenization) is performed on the training data.
- Parameters:
sents – An iterable of segmented sentences (each sentence is a sequence of words).
- fit_unsegmented(sent_strs: Sequence[str]) None#
Train the model with unsupervised unsegmented sentences.
Uses the Baum-Welch (EM) algorithm. If the model was previously fitted (e.g., via
fit_segmented), the existing parameters serve as EM initialization (warm start).- Parameters:
sent_strs – An iterable of unsegmented sentences.
- predict(sent_strs: Sequence[str]) list[list[str]]#
Segment the given unsegmented sentences.
- Parameters:
sent_strs – An iterable of unsegmented sentences.
- Returns:
A list of segmented sentences.
- save(path: str | os.PathLike[str]) None#
Save the model to a zstd-compressed FlatBuffers binary.
- Parameters:
path – The path where the model will be saved. The file extension name
.fb.zstis recommended.
- load(path: str | os.PathLike[str]) None#
Load a model.
- Parameters:
path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.
- class rustling.wordseg.LongestStringMatching(*, max_word_length: int)#
Longest string matching segmenter.
This model constructs predicted words by moving from left to right along an unsegmented sentence and finding the longest matching words, constrained by a maximum word length parameter.
- fit(sents: Sequence[Sequence[str]]) None#
Train the model with the input segmented sentences.
No cleaning or preprocessing (e.g., normalizing upper/lowercase, tokenization) is performed on the training data.
- Parameters:
sents – An iterable of segmented sentences (each sentence is a sequence of words).
- predict(sent_strs: Sequence[str]) list[list[str]]#
Segment the given unsegmented sentences.
- Parameters:
sent_strs – An iterable of unsegmented sentences.
- Returns:
A list of segmented sentences.
- save(path: str | os.PathLike[str]) None#
Save the model to a zstd-compressed FlatBuffers binary.
- Parameters:
path – The path where the model will be saved. The file extension name
.fb.zstis recommended.
- load(path: str | os.PathLike[str]) None#
Load a model.
- Parameters:
path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.