N-grams#
Rustling provides an efficient n-gram counter for extracting and counting n-gram frequencies from sequences.
Basic Usage#
The Ngrams class counts n-grams from sequences of strings.
from rustling.ngram import Ngrams
ng = Ngrams(n=2)
ng.count(["the", "cat", "sat"])
ng.count(["the", "dog", "ran"])
print(ng[("the", "cat")]) # 1
print(ng[("the", "dog")]) # 1
# Most common bigrams
print(ng.most_common(2))
# [(('the', 'cat'), 1), (('the', 'dog'), 1)]
Counting from Multiple Sequences#
Use count_seqs() to count n-grams from multiple sequences at once.
from rustling.ngram import Ngrams
ng = Ngrams(n=2)
ng.count_seqs([
["the", "cat", "sat"],
["the", "dog", "ran"],
["the", "cat", "ran"],
])
print(ng[("the", "cat")]) # 2
print(ng.total()) # 6
Mixed Orders#
Set min_n to collect n-grams of multiple orders simultaneously.
from rustling.ngram import Ngrams
ng = Ngrams(n=3, min_n=1)
ng.count(["a", "b", "c"])
# Unigrams, bigrams, and trigrams are all counted
print(ng.most_common(order=1)) # unigrams
print(ng.most_common(order=2)) # bigrams
print(ng.most_common(order=3)) # trigrams
Converting to Counter#
Use to_counter() to get a standard collections.Counter.
from rustling.ngram import Ngrams
ng = Ngrams(n=2)
ng.count_seqs([
["the", "cat", "sat"],
["the", "dog", "ran"],
])
counter = ng.to_counter()
print(counter)
# Counter({('the', 'cat'): 1, ('cat', 'sat'): 1, ('the', 'dog'): 1, ('dog', 'ran'): 1})
Combining Counters#
Ngrams objects can be combined with + or +=.
from rustling.ngram import Ngrams
ng1 = Ngrams(n=2)
ng1.count(["the", "cat", "sat"])
ng2 = Ngrams(n=2)
ng2.count(["the", "dog", "ran"])
combined = ng1 + ng2
print(combined.total()) # 4