Tokenize a set of texts and compute a term frequency matrix.
term_matrix(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, transpose = FALSE, ...) term_counts(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, ...)
x | a text vector to tokenize. |
---|---|
filter | if non- |
ngrams | an integer vector of n-gram lengths to include, or
|
select | a character vector of terms to count, or |
group | if non- |
transpose | a logical value indicating whether to transpose the result, putting terms as rows instead of columns. |
… | additional properties to set on the text filter. |
term_matrix
tokenizes a set of texts and computes the occurrence
counts for each term, returning the result as a sparse matrix
(texts-by-terms). term_counts
returns the same information, but
in a data frame.
If ngrams
is non-NULL
, then multi-type n-grams are
included in the output for all lengths appearing in the ngrams
argument. If ngrams
is NULL
but select
is
non-NULL
, then all n-grams appearing in the select
set
are included. If both ngrams
and select
are NULL
,
then only unigrams (single type terms) are included.
If group
is NULL
, then the output has one set of term
counts for each input text. Otherwise, we convert group
to
a factor
and compute one set of term counts for each level.
Texts with NA
values for group
get skipped.
term_matrix
with transpose = FALSE
returns a sparse matrix
in "dgCMatrix"
format with one column for each term and one row for
each input text or (if group
is non-NULL
) for each grouping
level. If filter$select
is non-NULL
, then the column names
will be equal to filter$select
. Otherwise, the columns are assigned
in arbitrary order.
term_matrix
with transpose = TRUE
returns the transpose of
the term matrix, in "dgCMatrix"
format.
term_counts
with group = NULL
returns a data frame with one
row for each entry of the term matrix, and columns "text"
,
"term"
, and "count"
giving the text ID, term, and count.
The "term"
column is a factor with levels equal to the selected
terms. The "text"
column is a factor with levels equal to names(as_corpus_text(x))
;
calling as.integer
on the "text"
column converts from
the factor values to the integer row index in the term matrix.
term_counts
with group
non-NULL
behaves similarly,
but the result instead has columns named "group"
, "term"
,
and "count"
, with "group"
giving the grouping level, as
a factor.
text <- c("A rose is a rose is a rose.", "A Rose is red, a violet is blue!", "A rose by any other name would smell as sweet.") term_matrix(text)#> 3 x 17 sparse Matrix of class "dgCMatrix"#>#> #> [1,] . . 1 3 . . . . 2 . . . 3 . . . . #> [2,] 1 1 . 2 . . 1 . 2 . . 1 1 . . 1 . #> [3,] . . 1 1 1 1 . 1 . 1 1 . 1 1 1 . 1# select certain terms term_matrix(text, select = c("rose", "red", "violet", "sweet"))#> 3 x 4 sparse Matrix of class "dgCMatrix" #> rose red violet sweet #> [1,] 3 . . . #> [2,] 1 1 1 . #> [3,] 1 . . 1# specify a grouping factor term_matrix(text, group = c("Good", "Bad", "Good"))#> 2 x 17 sparse Matrix of class "dgCMatrix"#>#> #> Bad 1 1 . 2 . . 1 . 2 . . 1 1 . . 1 . #> Good . . 2 4 1 1 . 1 2 1 1 . 4 1 1 . 1# include higher-order n-grams term_matrix(text, ngrams = 1:3)#> 3 x 57 sparse Matrix of class "dgCMatrix"#>#> #> [1,] . . . . 1 3 3 1 . 2 . . . . . . . . . . . . . 2 2 2 . . . . . . . . . . . #> [2,] 1 1 1 1 . 2 1 . . 1 1 1 . . . . . . 1 1 . . . 2 . . 1 1 1 1 . . . . . . 1 #> [3,] . . . . 1 1 1 . 1 . . . 1 1 1 1 1 1 . . 1 1 1 . . . . . . . 1 1 1 1 1 1 . #> #> [1,] . . 3 1 . . 2 2 . . . . . . . . . . . . #> [2,] 1 1 1 . . . 1 . 1 . . . . . 1 1 1 . . . #> [3,] . . 1 . 1 1 . . . 1 1 1 1 1 . . . 1 1 1# select certain multi-type terms term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))#> 3 x 4 sparse Matrix of class "dgCMatrix" #> a rose a violet sweet smell #> [1,] 3 . . . #> [2,] 1 1 . . #> [3,] 1 . 1 1# transpose the result term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows#> 10 x 3 sparse Matrix of class "dgCMatrix" #> #> ! . 1 . #> , . 1 . #> , a . 1 . #> . 1 . 1 #> a 3 2 1 #> a rose 3 1 1 #> a violet . 1 . #> any . . 1 #> any other . . 1 #> as . . 1# data frame head(term_counts(text), n = 10) # first 10 rows#> text term count #> 1 2 ! 1 #> 2 2 , 1 #> 3 1 . 1 #> 4 3 . 1 #> 5 1 a 3 #> 6 2 a 2 #> 7 3 a 1 #> 8 3 any 1 #> 9 3 as 1 #> 10 2 blue 1# with grouping term_counts(text, group = c("Good", "Bad", "Good"))#> group term count #> 1 Bad ! 1 #> 2 Bad , 1 #> 3 Good . 2 #> 4 Bad a 2 #> 5 Good a 4 #> 6 Good any 1 #> 7 Good as 1 #> 8 Bad blue 1 #> 9 Good by 1 #> 10 Bad is 2 #> 11 Good is 2 #> 12 Good name 1 #> 13 Good other 1 #> 14 Bad red 1 #> 15 Bad rose 1 #> 16 Good rose 4 #> 17 Good smell 1 #> 18 Good sweet 1 #> 19 Bad violet 1 #> 20 Good would 1# taking names from the input term_counts(c(a = "One sentence.", b = "Another", c = "!!"))#> text term count #> 1 c ! 2 #> 2 a . 1 #> 3 b another 1 #> 4 a one 1 #> 5 a sentence 1