Make a stemmer from a set of (term, stem) pairs.
new_stemmer(term, stem, default = NULL, duplicates = "first", vectorize = TRUE)
| term | character vector of terms to stem. |
|---|---|
| stem | character vector the same length as |
| default | if non- |
| duplicates | action to take for duplicates in the |
| vectorize | whether to produce a vectorized stemmer that accepts and returns vector arguments. |
Giving a list of terms and a corresponding list of stems, this produces a
function that maps terms to their corresponding entry. If
default = NULL, then values absent from the term argument
get left as-is; otherwise, they get replaced by the default value.
The duplicates argument indicates the action to take if
there are duplicate entries in the term argument:
duplicates = "first" take the first matching entry in the
stem list.
duplicates = "last" take the last matching entry in the
stem list.
duplicates = "omit" use the default value for
duplicated terms.
duplicates = "fail" raise an error if there are duplicated
terms.
By default, with vectorize = TRUE, the resulting stemmer accepts a
character vector as input and returns a character vector of the same length
with entries giving the stems of the corresponding input entries.
Setting vectorize = FALSE gives a function that accepts a single input
and returns a single output. This can be more efficient when used as part of
a text_filter.
stem_snowball, text_filter, text_tokens.
# map uppercase to lowercase, leave others unchanged stemmer <- new_stemmer(LETTERS, letters) stemmer(c("A", "E", "I", "O", "U", "1", "2", "3"))#> [1] "a" "e" "i" "o" "u" "1" "2" "3"# map uppercase to lowercase, drop others stemmer <- new_stemmer(LETTERS, letters, default = NA) stemmer(c("A", "E", "I", "O", "U", "1", "2", "3"))#> [1] "a" "e" "i" "o" "u" NA NA NA