Tokenize a set of texts and compute a term frequency matrix.

term_matrix(x, filter = NULL, ngrams = NULL, select = NULL,
            group = NULL, transpose = FALSE, ...)

term_counts(x, filter = NULL, ngrams = NULL, select = NULL,
            group = NULL, ...)

Arguments

x

a text vector to tokenize.

filter

if non-NULL, a text filter to to use instead of the default text filter for x.

ngrams

an integer vector of n-gram lengths to include, or NULL to use the select argument to determine the n-gram lengths.

select

a character vector of terms to count, or NULL to count all terms that appear in x.

group

if non-NULL, a factor, character string, or integer vector the same length of x specifying the grouping behavior.

transpose

a logical value indicating whether to transpose the result, putting terms as rows instead of columns.

additional properties to set on the text filter.

Details

term_matrix tokenizes a set of texts and computes the occurrence counts for each term, returning the result as a sparse matrix (texts-by-terms). term_counts returns the same information, but in a data frame.

If ngrams is non-NULL, then multi-type n-grams are included in the output for all lengths appearing in the ngrams argument. If ngrams is NULL but select is non-NULL, then all n-grams appearing in the select set are included. If both ngrams and select are NULL, then only unigrams (single type terms) are included.

If group is NULL, then the output has one set of term counts for each input text. Otherwise, we convert group to a factor and compute one set of term counts for each level. Texts with NA values for group get skipped.

Value

term_matrix with transpose = FALSE returns a sparse matrix in "dgCMatrix" format with one column for each term and one row for each input text or (if group is non-NULL) for each grouping level. If filter$select is non-NULL, then the column names will be equal to filter$select. Otherwise, the columns are assigned in arbitrary order. term_matrix with transpose = TRUE returns the transpose of the term matrix, in "dgCMatrix" format. term_counts with group = NULL returns a data frame with one row for each entry of the term matrix, and columns "text", "term", and "count" giving the text ID, term, and count. The "term" column is a factor with levels equal to the selected terms. The "text" column is a factor with levels equal to names(as_corpus_text(x)); calling as.integer on the "text" column converts from the factor values to the integer row index in the term matrix. term_counts with group non-NULL behaves similarly, but the result instead has columns named "group", "term", and "count", with "group" giving the grouping level, as a factor.

See also

text_tokens, term_stats.

Examples

text <- c("A rose is a rose is a rose.", "A Rose is red, a violet is blue!", "A rose by any other name would smell as sweet.") term_matrix(text)
#> 3 x 17 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 17 column names ‘!’, ‘,’, ‘.’ ... ]]
#> #> [1,] . . 1 3 . . . . 2 . . . 3 . . . . #> [2,] 1 1 . 2 . . 1 . 2 . . 1 1 . . 1 . #> [3,] . . 1 1 1 1 . 1 . 1 1 . 1 1 1 . 1
# select certain terms term_matrix(text, select = c("rose", "red", "violet", "sweet"))
#> 3 x 4 sparse Matrix of class "dgCMatrix" #> rose red violet sweet #> [1,] 3 . . . #> [2,] 1 1 1 . #> [3,] 1 . . 1
# specify a grouping factor term_matrix(text, group = c("Good", "Bad", "Good"))
#> 2 x 17 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 17 column names ‘!’, ‘,’, ‘.’ ... ]]
#> #> Bad 1 1 . 2 . . 1 . 2 . . 1 1 . . 1 . #> Good . . 2 4 1 1 . 1 2 1 1 . 4 1 1 . 1
# include higher-order n-grams term_matrix(text, ngrams = 1:3)
#> 3 x 57 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 57 column names ‘!’, ‘,’, ‘, a’ ... ]]
#> #> [1,] . . . . 1 3 3 1 . 2 . . . . . . . . . . . . . 2 2 2 . . . . . . . . . . . #> [2,] 1 1 1 1 . 2 1 . . 1 1 1 . . . . . . 1 1 . . . 2 . . 1 1 1 1 . . . . . . 1 #> [3,] . . . . 1 1 1 . 1 . . . 1 1 1 1 1 1 . . 1 1 1 . . . . . . . 1 1 1 1 1 1 . #> #> [1,] . . 3 1 . . 2 2 . . . . . . . . . . . . #> [2,] 1 1 1 . . . 1 . 1 . . . . . 1 1 1 . . . #> [3,] . . 1 . 1 1 . . . 1 1 1 1 1 . . . 1 1 1
# select certain multi-type terms term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))
#> 3 x 4 sparse Matrix of class "dgCMatrix" #> a rose a violet sweet smell #> [1,] 3 . . . #> [2,] 1 1 . . #> [3,] 1 . 1 1
# transpose the result term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows
#> 10 x 3 sparse Matrix of class "dgCMatrix" #> #> ! . 1 . #> , . 1 . #> , a . 1 . #> . 1 . 1 #> a 3 2 1 #> a rose 3 1 1 #> a violet . 1 . #> any . . 1 #> any other . . 1 #> as . . 1
# data frame head(term_counts(text), n = 10) # first 10 rows
#> text term count #> 1 2 ! 1 #> 2 2 , 1 #> 3 1 . 1 #> 4 3 . 1 #> 5 1 a 3 #> 6 2 a 2 #> 7 3 a 1 #> 8 3 any 1 #> 9 3 as 1 #> 10 2 blue 1
# with grouping term_counts(text, group = c("Good", "Bad", "Good"))
#> group term count #> 1 Bad ! 1 #> 2 Bad , 1 #> 3 Good . 2 #> 4 Bad a 2 #> 5 Good a 4 #> 6 Good any 1 #> 7 Good as 1 #> 8 Bad blue 1 #> 9 Good by 1 #> 10 Bad is 2 #> 11 Good is 2 #> 12 Good name 1 #> 13 Good other 1 #> 14 Bad red 1 #> 15 Bad rose 1 #> 16 Good rose 4 #> 17 Good smell 1 #> 18 Good sweet 1 #> 19 Bad violet 1 #> 20 Good would 1
# taking names from the input term_counts(c(a = "One sentence.", b = "Another", c = "!!"))
#> text term count #> 1 c ! 2 #> 2 a . 1 #> 3 b another 1 #> 4 a one 1 #> 5 a sentence 1