Create or test for text objects.
as_corpus_text(x, filter = NULL, ..., names = NULL) is_corpus_text(x)
object to be coerced or tested.
text filter properties to set on the result.
corpus_text type is a new data type provided by the
package suitable for processing international (Unicode) text. Text vectors
behave like character vectors (and can be converted to them with the
as.character function). They can be created using the
read_ndjson function or by converting another object using the
All text objects have a
text_filter property specify how to
transform the text into tokens or segment it into sentences.
The default behavior for
as_corpus_text is to proceed as follows:
x is a
character vector, then we create
text vector from
x is a data frame, then we call
x$text if a column named
"text" exists in
the data frame. If the data frame does not have a column
"text", then we fail with an error message.
x is a
corpus_text object, then we drop all
attributes and we set the class to
The default behavior for when none of the above conditions
are true is to call
as.character on the object first,
preserving the names, and then and call
the returned character object.
In all cases, when the
NULL, we set the result
rownames(x) for a data frame
names is a character vector, we set the result names
to this vector of names
NULL, we set the result text
filter is non-
missing, we set the result text filter to this value. In either case,
if there are additional names arguments, then we override the filter
properties specified by the names of these arguments with the new values
Note that the special handling for the names of the object is different
from the other R conversion functions (
as.character, etc.), which drop the names.
as_corpus_text is generic: you can write methods to handle specific
classes of objects.
as_corpus_text attempts to coerce its argument to
text type and
text_filter properties; it strips
all other attributes.
FALSE depending on
whether its argument is of text type or not.
as_corpus_text("hello, world!")#>  "hello, world!"as_corpus_text(c(a = "goodnight", b = "moon")) # keeps names#> a b #> "goodnight" "moon"# set a filter property as_corpus_text(c(a = "goodnight", b = "moon"), stemmer = "english")#> a b #> "goodnight" "moon"is_corpus_text("hello") # FALSE, "hello" is character, not text#>  FALSE