Language Models#

Rustling provides n-gram language models with various smoothing methods.

MLE (Maximum Likelihood Estimation)#

The MLE model uses raw counts to estimate probabilities, with no smoothing.

from rustling.lm import MLE

model = MLE(order=2)
model.fit([
    ["the", "cat", "sat"],
    ["the", "dog", "ran"],
])

# Score a word given context
print(model.score("cat", ["the"]))   # 0.5
print(model.score("dog", ["the"]))   # 0.5

# Log probability (base 2)
print(model.logscore("cat", ["the"]))  # -1.0

Lidstone Smoothing#

The Lidstone model adds a constant gamma to all counts, ensuring non-zero probabilities for unseen n-grams.

from rustling.lm import Lidstone

model = Lidstone(order=2, gamma=0.1)
model.fit([
    ["the", "cat", "sat"],
    ["the", "dog", "ran"],
])

# Unseen n-grams get non-zero probability
print(model.score("bird", ["the"]))  # > 0

Laplace Smoothing#

The Laplace model is Lidstone smoothing with gamma=1.

from rustling.lm import Laplace

model = Laplace(order=2)
model.fit([
    ["the", "cat", "sat"],
    ["the", "dog", "ran"],
])

print(model.score("cat", ["the"]))

Text Generation#

All models support text generation via weighted random sampling.

from rustling.lm import MLE

model = MLE(order=2)
model.fit([
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "ran", "to", "the", "park"],
])

# Generate words with a random seed for reproducibility
words = model.generate(num_words=5, random_seed=42)
print(words)

# Generate with a text seed (starting context)
words = model.generate(num_words=3, text_seed=["the"], random_seed=42)
print(words)