Word Frequencies and Ngrams#
For word combinatorics, CHAT provides the
word_ngrams() method:
word_ngrams() takes a single required argument n
for the n-gram order.
It returns an rustling.ngram.Ngrams object, which stores ngrams efficiently and
provides useful methods such as most_common(), to_counter(),
items(), and count().
N-grams do not cross utterance boundaries.
To filter by participant, use filter()
before calling word_ngrams().
For illustration, let’s check out some of the word trigrams in Eve’s data from Brown. To make it slightly more interesting, we are going to look at child speech and child-directed speech separately.
import rustling
brown = rustling.read_chat("path/to/your/local/Brown.zip")
eve = brown.filter(files="Eve")
eve_chi = eve.filter(participants="CHI")
eve_cds = eve.filter(participants="^(?!CHI$)")
trigrams_child_speech = eve_chi.word_ngrams(3)
trigrams_child_speech.most_common(10)
# [(('grape', 'juice', '.'), 74),
# (('what', 'that', '?'), 50),
# (('a@l', 'b@l', 'c@l'), 47),
# (('right', 'there', '.'), 45),
# (('another', 'one', '.'), 45),
# (('in', 'there', '.'), 43),
# (('b@l', 'c@l', '.'), 42),
# (('hi', 'Fraser', '.'), 39),
# (('I', 'want', 'some'), 39),
# (('a', 'minute', '.'), 35)]
trigrams_child_directed_speech = eve_cds.word_ngrams(3)
trigrams_child_directed_speech.most_common(10)
# [(("that's", 'right', '.'), 178),
# (('what', 'are', 'you'), 149),
# (('is', 'that', '?'), 124),
# (('what', 'is', 'that'), 99),
# ((',', 'Eve', '.'), 95),
# (('are', 'you', 'doing'), 94),
# (("what's", 'that', '?'), 92),
# (('would', 'you', 'like'), 89),
# (('is', 'it', '?'), 89),
# (('what', 'did', 'you'), 77)]
Just this very brief result using word trigrams appears to show a contrast between various demands being frequent in child speech, versus the dominant usage of confirmation and attempts to get the child’s attention using questions in child-directed speech.
The word_ngrams() method returns an rustling.ngram.Ngrams object,
which has the most_common() method as used above.
If you need a Counter instead,
call to_counter() on the rustling.ngram.Ngrams object:
counter = trigrams_child_speech.to_counter()
type(counter)
# <class 'collections.Counter'>
counter.most_common(5)
# [(('grape', 'juice', '.'), 74),
# (('what', 'that', '?'), 50),
# (('a@l', 'b@l', 'c@l'), 47),
# (('right', 'there', '.'), 45),
# (('another', 'one', '.'), 45)]
For word frequencies, use word_ngrams(1) to get unigrams:
word_freq_chi = eve_chi.word_ngrams(1)
word_freq_chi.most_common(5)
# [(('.',), 10389),
# (('?',), 1449),
# (('I',), 1197),
# (('that',), 1047),
# (('a',), 883)]
Note that unigrams are represented as single-element tuples, consistent with how all ngrams are tuples regardless of n.