rustling.lm#

Language models.

Package Contents#

class rustling.lm.MLE(*, order: int)#

Maximum Likelihood Estimation language model.

An n-gram language model with no smoothing.

fit(sents: Sequence[Sequence[str]]) → None#

Train the language model on tokenized sentences.

Each sentence is a list of tokens. The model extracts n-grams of all orders from 1 to the model order and counts their occurrences. Sentences are automatically padded with <s> and </s> tokens.

Parameters:: sents – An iterable of tokenized sentences.

score(word: str, context: Sequence[str] | None = None) → float#

Return the probability of a word given a context.

Maps out-of-vocabulary words to <UNK> via the vocabulary.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

The probability P(word | context).

Raises:

ValueError – If the model has not been fitted yet.

unmasked_score(word: str, context: Sequence[str] | None = None) → float#

Return the probability of a word given a context, without OOV mapping.

Unlike score, this method does not map out-of-vocabulary words to <UNK>.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

The probability P(word | context).

Raises:

ValueError – If the model has not been fitted yet.

logscore(word: str, context: Sequence[str] | None = None) → float#

Return the log (base 2) probability of a word given a context.

Maps out-of-vocabulary words to <UNK> via the vocabulary.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

log2(P(word | context)). Returns negative infinity if probability is 0.

Raises:

ValueError – If the model has not been fitted yet.

generate(*, num_words: int = 1, text_seed: Sequence[str] | None = None, random_seed: int | None = None) → list[str]#

Generate words from the language model.

Uses weighted random sampling from the conditional distribution. Generation stops early if </s> (end-of-sentence) is sampled or if no continuations are available for the current context.

Parameters:

num_words – Number of words to generate.
text_seed – Seed text (context to start from). Defaults to beginning-of-sentence context.
random_seed – Random seed for reproducibility.

Returns:

A list of generated words.

Raises:

ValueError – If the model has not been fitted yet.

property order: int#: The order of the n-gram model.

property vocab_size: int#: The vocabulary size (including special tokens).

save(path: str | os.PathLike[str]) → None#

Save the model to a zstd-compressed FlatBuffers binary.

Parameters:: path – The path where the model will be saved. The file extension name .fb.zst is recommended.

load(path: str | os.PathLike[str]) → None#

Load a model.

Parameters:

path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.

Raises:

FileNotFoundError – If the file does not exist.
EnvironmentError – If the file cannot be read as a language model or the smoothing/order does not match.

class rustling.lm.Lidstone(*, order: int, gamma: float)#

Lidstone (additive) smoothing language model.

An n-gram language model with Lidstone smoothing, which adds a constant gamma to all counts.

fit(sents: Sequence[Sequence[str]]) → None#

Train the language model on tokenized sentences.

Each sentence is a list of tokens. The model extracts n-grams of all orders from 1 to the model order and counts their occurrences. Sentences are automatically padded with <s> and </s> tokens.

Parameters:: sents – An iterable of tokenized sentences.

score(word: str, context: Sequence[str] | None = None) → float#

Return the probability of a word given a context.

Maps out-of-vocabulary words to <UNK> via the vocabulary.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

The probability P(word | context).

Raises:

ValueError – If the model has not been fitted yet.

unmasked_score(word: str, context: Sequence[str] | None = None) → float#

Return the probability of a word given a context, without OOV mapping.

Unlike score, this method does not map out-of-vocabulary words to <UNK>.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

The probability P(word | context).

Raises:

ValueError – If the model has not been fitted yet.

logscore(word: str, context: Sequence[str] | None = None) → float#

Return the log (base 2) probability of a word given a context.

Maps out-of-vocabulary words to <UNK> via the vocabulary.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

log2(P(word | context)). Returns negative infinity if probability is 0.

Raises:

ValueError – If the model has not been fitted yet.

generate(*, num_words: int = 1, text_seed: Sequence[str] | None = None, random_seed: int | None = None) → list[str]#

Generate words from the language model.

Uses weighted random sampling from the conditional distribution. Generation stops early if </s> (end-of-sentence) is sampled or if no continuations are available for the current context.

Parameters:

num_words – Number of words to generate.
text_seed – Seed text (context to start from). Defaults to beginning-of-sentence context.
random_seed – Random seed for reproducibility.

Returns:

A list of generated words.

Raises:

ValueError – If the model has not been fitted yet.

property order: int#: The order of the n-gram model.

property vocab_size: int#: The vocabulary size (including special tokens).

property gamma: float#: The smoothing parameter.

save(path: str | os.PathLike[str]) → None#

Save the model to a zstd-compressed FlatBuffers binary.

Parameters:: path – The path where the model will be saved. The file extension name .fb.zst is recommended.

load(path: str | os.PathLike[str]) → None#

Load a model.

Parameters:

path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.

Raises:

FileNotFoundError – If the file does not exist.
EnvironmentError – If the file cannot be read as a language model or the smoothing/order/gamma does not match.

class rustling.lm.Laplace(*, order: int)#

Laplace (add-one) smoothing language model.

An n-gram language model with Laplace smoothing (Lidstone with gamma=1).

fit(sents: Sequence[Sequence[str]]) → None#

Train the language model on tokenized sentences.

Each sentence is a list of tokens. The model extracts n-grams of all orders from 1 to the model order and counts their occurrences. Sentences are automatically padded with <s> and </s> tokens.

Parameters:: sents – An iterable of tokenized sentences.

score(word: str, context: Sequence[str] | None = None) → float#

Return the probability of a word given a context.

Maps out-of-vocabulary words to <UNK> via the vocabulary.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

The probability P(word | context).

Raises:

ValueError – If the model has not been fitted yet.

unmasked_score(word: str, context: Sequence[str] | None = None) → float#

Return the probability of a word given a context, without OOV mapping.

Unlike score, this method does not map out-of-vocabulary words to <UNK>.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

The probability P(word | context).

Raises:

ValueError – If the model has not been fitted yet.

logscore(word: str, context: Sequence[str] | None = None) → float#

Return the log (base 2) probability of a word given a context.

Maps out-of-vocabulary words to <UNK> via the vocabulary.

Parameters:

word – The word to score.
context – The preceding context words.

Returns:

log2(P(word | context)). Returns negative infinity if probability is 0.

Raises:

ValueError – If the model has not been fitted yet.

generate(*, num_words: int = 1, text_seed: Sequence[str] | None = None, random_seed: int | None = None) → list[str]#

Generate words from the language model.

Uses weighted random sampling from the conditional distribution. Generation stops early if </s> (end-of-sentence) is sampled or if no continuations are available for the current context.

Parameters:

num_words – Number of words to generate.
text_seed – Seed text (context to start from). Defaults to beginning-of-sentence context.
random_seed – Random seed for reproducibility.

Returns:

A list of generated words.

Raises:

ValueError – If the model has not been fitted yet.

property order: int#: The order of the n-gram model.

property vocab_size: int#: The vocabulary size (including special tokens).

save(path: str | os.PathLike[str]) → None#

Save the model to a zstd-compressed FlatBuffers binary.

Parameters:: path – The path where the model will be saved. The file extension name .fb.zst is recommended.

load(path: str | os.PathLike[str]) → None#

Load a model.

Parameters:

path – The path where the model, stored as a zstd-compressed FlatBuffers binary, is located.

Raises:

FileNotFoundError – If the file does not exist.
EnvironmentError – If the file cannot be read as a language model or the smoothing/order does not match.