Term Statistics

Tokenize a set of texts and tabulate the term occurrence statistics.

term_stats(x, filter = NULL, ngrams = NULL,
           min_count = NULL, max_count = NULL,
           min_support = NULL, max_support = NULL, types = FALSE,
           subset, ...)

Arguments

x	a text vector to tokenize.
filter	if non-`NULL`, a text filter to to use instead of the default text filter for `x`.
ngrams	an integer vector of n-gram lengths to include, or `NULL` for length-1 n-grams only.
min_count	a numeric scalar giving the minimum term count to include in the output, or `NULL` for no minimum count.
max_count	a numeric scalar giving the maximum term count to include in the output, or `NULL` for no maximum count.
min_support	a numeric scalar giving the minimum term support to include in the output, or `NULL` for no minimum support.
max_support	a numeric scalar giving the maximum term support to include in the output, or `NULL` for no maximum support.
types	a logical value indicating whether to include columns for the types that make up the terms.
subset	logical expression indicating elements or rows to keep: missing values are taken as false.
…	additional properties to set on the text filter.

Details

term_stats tokenizes a set of texts and computes the occurrence counts and supports for each term. The ‘count’ is the number of occurrences of the term across all texts; the ‘support’ is the number of texts containing the term. Each appearance of a term increments its count by one. Likewise, an appearance of a term in text i increments its support once, not for each occurrence in the text.

To include multi-type terms, specify the designed term lengths using the ngrams argument.

Value

A data frame with columns named term, count, and support, with one row for each appearing term. Rows are sorted in descending order according to support and then count, with ties broken lexicographically by term, using the character ordering determined by the current locale (see Comparison for details).

If types = TRUE, then the result also includes columns named type1, type2, etc. for the types that make up the term.

Examples

term_stats("A rose is a rose is a rose.")
#>   term count support
#> 1 a        3       1
#> 2 rose     3       1
#> 3 is       2       1
#> 4 .        1       1

# remove punctuation and English stop words
term_stats("A rose is a rose is a rose.",
           text_filter(drop_symbol = TRUE, drop = stopwords_en))
#>   term count support
#> 1 rose     3       1
#> 2 .        1       1

# unigrams, bigrams, and trigrams
term_stats("A rose is a rose is a rose.", ngrams = 1:3)
#>    term      count support
#> 1  a             3       1
#> 2  a rose        3       1
#> 3  rose          3       1
#> 4  a rose is     2       1
#> 5  is            2       1
#> 6  is a          2       1
#> 7  is a rose     2       1
#> 8  rose is       2       1
#> 9  rose is a     2       1
#> 10 .             1       1
#> 11 a rose .      1       1
#> 12 rose .        1       1

# also include the type information
term_stats("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)
#>    term      type1 type2 type3 count support
#> 1  a         a     <NA>  <NA>      3       1
#> 2  a rose    a     rose  <NA>      3       1
#> 3  rose      rose  <NA>  <NA>      3       1
#> 4  a rose is a     rose  is        2       1
#> 5  is        is    <NA>  <NA>      2       1
#> 6  is a      is    a     <NA>      2       1
#> 7  is a rose is    a     rose      2       1
#> 8  rose is   rose  is    <NA>      2       1
#> 9  rose is a rose  is    a         2       1
#> 10 .         .     <NA>  <NA>      1       1
#> 11 a rose .  a     rose  .         1       1
#> 12 rose .    rose  .     <NA>      1       1

Arguments

Details

Value

See also

Examples

Contents