rustling.lm#
Language models.
Package Contents#
- class rustling.lm.MLE(*, order: int)#
Maximum Likelihood Estimation language model.
An n-gram language model with no smoothing.
- fit(sents: Sequence[Sequence[str]]) None#
Train the language model on tokenized sentences.
Each sentence is a list of tokens. The model extracts n-grams of all orders from 1 to the model order and counts their occurrences. Sentences are automatically padded with
<s>and</s>tokens.- Parameters:
sents – An iterable of tokenized sentences.
- score(word: str, context: Sequence[str] | None = None) float#
Return the probability of a word given a context.
Maps out-of-vocabulary words to
<UNK>via the vocabulary.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
The probability P(word | context).
- Raises:
ValueError – If the model has not been fitted yet.
- unmasked_score(word: str, context: Sequence[str] | None = None) float#
Return the probability of a word given a context, without OOV mapping.
Unlike
score, this method does not map out-of-vocabulary words to<UNK>.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
The probability P(word | context).
- Raises:
ValueError – If the model has not been fitted yet.
- logscore(word: str, context: Sequence[str] | None = None) float#
Return the log (base 2) probability of a word given a context.
Maps out-of-vocabulary words to
<UNK>via the vocabulary.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
log2(P(word | context)). Returns negative infinity if probability is 0.
- Raises:
ValueError – If the model has not been fitted yet.
- generate(*, num_words: int = 1, text_seed: Sequence[str] | None = None, random_seed: int | None = None) list[str]#
Generate words from the language model.
Uses weighted random sampling from the conditional distribution. Generation stops early if
</s>(end-of-sentence) is sampled or if no continuations are available for the current context.- Parameters:
num_words – Number of words to generate.
text_seed – Seed text (context to start from). Defaults to beginning-of-sentence context.
random_seed – Random seed for reproducibility.
- Returns:
A list of generated words.
- Raises:
ValueError – If the model has not been fitted yet.
- save(path: str | os.PathLike[str]) None#
Save the model to a zstd-compressed FlatBuffers binary.
- Parameters:
path – The path where the model will be saved. The file extension name
.fb.zstis recommended.
- load(path: str | os.PathLike[str]) None#
Load a model.
- Parameters:
path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.
- Raises:
FileNotFoundError – If the file does not exist.
EnvironmentError – If the file cannot be read as a language model or the smoothing/order does not match.
- class rustling.lm.Lidstone(*, order: int, gamma: float)#
Lidstone (additive) smoothing language model.
An n-gram language model with Lidstone smoothing, which adds a constant gamma to all counts.
- fit(sents: Sequence[Sequence[str]]) None#
Train the language model on tokenized sentences.
Each sentence is a list of tokens. The model extracts n-grams of all orders from 1 to the model order and counts their occurrences. Sentences are automatically padded with
<s>and</s>tokens.- Parameters:
sents – An iterable of tokenized sentences.
- score(word: str, context: Sequence[str] | None = None) float#
Return the probability of a word given a context.
Maps out-of-vocabulary words to
<UNK>via the vocabulary.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
The probability P(word | context).
- Raises:
ValueError – If the model has not been fitted yet.
- unmasked_score(word: str, context: Sequence[str] | None = None) float#
Return the probability of a word given a context, without OOV mapping.
Unlike
score, this method does not map out-of-vocabulary words to<UNK>.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
The probability P(word | context).
- Raises:
ValueError – If the model has not been fitted yet.
- logscore(word: str, context: Sequence[str] | None = None) float#
Return the log (base 2) probability of a word given a context.
Maps out-of-vocabulary words to
<UNK>via the vocabulary.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
log2(P(word | context)). Returns negative infinity if probability is 0.
- Raises:
ValueError – If the model has not been fitted yet.
- generate(*, num_words: int = 1, text_seed: Sequence[str] | None = None, random_seed: int | None = None) list[str]#
Generate words from the language model.
Uses weighted random sampling from the conditional distribution. Generation stops early if
</s>(end-of-sentence) is sampled or if no continuations are available for the current context.- Parameters:
num_words – Number of words to generate.
text_seed – Seed text (context to start from). Defaults to beginning-of-sentence context.
random_seed – Random seed for reproducibility.
- Returns:
A list of generated words.
- Raises:
ValueError – If the model has not been fitted yet.
- save(path: str | os.PathLike[str]) None#
Save the model to a zstd-compressed FlatBuffers binary.
- Parameters:
path – The path where the model will be saved. The file extension name
.fb.zstis recommended.
- load(path: str | os.PathLike[str]) None#
Load a model.
- Parameters:
path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.
- Raises:
FileNotFoundError – If the file does not exist.
EnvironmentError – If the file cannot be read as a language model or the smoothing/order/gamma does not match.
- class rustling.lm.Laplace(*, order: int)#
Laplace (add-one) smoothing language model.
An n-gram language model with Laplace smoothing (Lidstone with gamma=1).
- fit(sents: Sequence[Sequence[str]]) None#
Train the language model on tokenized sentences.
Each sentence is a list of tokens. The model extracts n-grams of all orders from 1 to the model order and counts their occurrences. Sentences are automatically padded with
<s>and</s>tokens.- Parameters:
sents – An iterable of tokenized sentences.
- score(word: str, context: Sequence[str] | None = None) float#
Return the probability of a word given a context.
Maps out-of-vocabulary words to
<UNK>via the vocabulary.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
The probability P(word | context).
- Raises:
ValueError – If the model has not been fitted yet.
- unmasked_score(word: str, context: Sequence[str] | None = None) float#
Return the probability of a word given a context, without OOV mapping.
Unlike
score, this method does not map out-of-vocabulary words to<UNK>.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
The probability P(word | context).
- Raises:
ValueError – If the model has not been fitted yet.
- logscore(word: str, context: Sequence[str] | None = None) float#
Return the log (base 2) probability of a word given a context.
Maps out-of-vocabulary words to
<UNK>via the vocabulary.- Parameters:
word – The word to score.
context – The preceding context words.
- Returns:
log2(P(word | context)). Returns negative infinity if probability is 0.
- Raises:
ValueError – If the model has not been fitted yet.
- generate(*, num_words: int = 1, text_seed: Sequence[str] | None = None, random_seed: int | None = None) list[str]#
Generate words from the language model.
Uses weighted random sampling from the conditional distribution. Generation stops early if
</s>(end-of-sentence) is sampled or if no continuations are available for the current context.- Parameters:
num_words – Number of words to generate.
text_seed – Seed text (context to start from). Defaults to beginning-of-sentence context.
random_seed – Random seed for reproducibility.
- Returns:
A list of generated words.
- Raises:
ValueError – If the model has not been fitted yet.
- save(path: str | os.PathLike[str]) None#
Save the model to a zstd-compressed FlatBuffers binary.
- Parameters:
path – The path where the model will be saved. The file extension name
.fb.zstis recommended.
- load(path: str | os.PathLike[str]) None#
Load a model.
- Parameters:
path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.
- Raises:
FileNotFoundError – If the file does not exist.
EnvironmentError – If the file cannot be read as a language model or the smoothing/order does not match.