Tokenize a set of texts and tabulate the term occurrence statistics.
term_stats(x, filter = NULL, ngrams = NULL, min_count = NULL, max_count = NULL, min_support = NULL, max_support = NULL, types = FALSE, subset, ...)
x | a text vector to tokenize. |
---|---|
filter | if non- |
ngrams | an integer vector of n-gram lengths to include, or
|
min_count | a numeric scalar giving the minimum term count to include
in the output, or |
max_count | a numeric scalar giving the maximum term count to include
in the output, or |
min_support | a numeric scalar giving the minimum term support to
include in the output, or |
max_support | a numeric scalar giving the maximum term support to
include in the output, or |
types | a logical value indicating whether to include columns for the types that make up the terms. |
subset | logical expression indicating elements or rows to keep: missing values are taken as false. |
… | additional properties to set on the text filter. |
term_stats
tokenizes a set of texts and computes the occurrence
counts and supports for each term. The ‘count’ is the number of
occurrences of the term across all texts; the ‘support’ is the
number of texts containing the term. Each appearance of a term
increments its count by one. Likewise, an appearance of a term in text
i
increments its support once, not for each occurrence
in the text.
To include multi-type terms, specify the designed term lengths using
the ngrams
argument.
A data frame with columns named term
, count
, and
support
, with one row for each appearing term. Rows are sorted
in descending order according to support
and then count
,
with ties broken lexicographically by term
, using the
character ordering determined by the current locale
(see Comparison
for details).
If types = TRUE
, then the result also includes columns named
type1
, type2
, etc. for the types that make up the
term.
term_stats("A rose is a rose is a rose.")#> term count support #> 1 a 3 1 #> 2 rose 3 1 #> 3 is 2 1 #> 4 . 1 1# remove punctuation and English stop words term_stats("A rose is a rose is a rose.", text_filter(drop_symbol = TRUE, drop = stopwords_en))#> term count support #> 1 rose 3 1 #> 2 . 1 1# unigrams, bigrams, and trigrams term_stats("A rose is a rose is a rose.", ngrams = 1:3)#> term count support #> 1 a 3 1 #> 2 a rose 3 1 #> 3 rose 3 1 #> 4 a rose is 2 1 #> 5 is 2 1 #> 6 is a 2 1 #> 7 is a rose 2 1 #> 8 rose is 2 1 #> 9 rose is a 2 1 #> 10 . 1 1 #> 11 a rose . 1 1 #> 12 rose . 1 1# also include the type information term_stats("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)#> term type1 type2 type3 count support #> 1 a a <NA> <NA> 3 1 #> 2 a rose a rose <NA> 3 1 #> 3 rose rose <NA> <NA> 3 1 #> 4 a rose is a rose is 2 1 #> 5 is is <NA> <NA> 2 1 #> 6 is a is a <NA> 2 1 #> 7 is a rose is a rose 2 1 #> 8 rose is rose is <NA> 2 1 #> 9 rose is a rose is a 2 1 #> 10 . . <NA> <NA> 1 1 #> 11 a rose . a rose . 1 1 #> 12 rose . rose . <NA> 1 1