Create or test for text objects.

as_corpus_text(x, filter = NULL, ..., names = NULL)

is_corpus_text(x)

Arguments

x

object to be coerced or tested.

filter

if non-NULL, a text filter for the converted result.

text filter properties to set on the result.

names

if non-NULL character vector of names for the converted result.

Details

The corpus_text type is a new data type provided by the corpus package suitable for processing international (Unicode) text. Text vectors behave like character vectors (and can be converted to them with the as.character function). They can be created using the read_ndjson function or by converting another object using the as_corpus_text function.

All text objects have a text_filter property specify how to transform the text into tokens or segment it into sentences.

The default behavior for as_corpus_text is to proceed as follows:

  1. If x is a character vector, then we create a new text vector from x.

  2. If x is a data frame, then we call as_corpus_text on x$text if a column named "text" exists in the data frame. If the data frame does not have a column named "text", then we fail with an error message.

  3. If x is a corpus_text object, then we drop all attributes and we set the class to "corpus_text".

  4. The default behavior for when none of the above conditions are true is to call as.character on the object first, preserving the names, and then and call as_corpus_text on the returned character object.

In all cases, when the names is NULL, we set the result names to names(x) (or rownames(x) for a data frame argument). When names is a character vector, we set the result names to this vector of names

Similarly, when filter is NULL, we set the result text filter to text_filter(x). When filter is non-NULL missing, we set the result text filter to this value. In either case, if there are additional names arguments, then we override the filter properties specified by the names of these arguments with the new values given.

Note that the special handling for the names of the object is different from the other R conversion functions (as.numeric, as.character, etc.), which drop the names. as_corpus_text is generic: you can write methods to handle specific classes of objects.

Value

as_corpus_text attempts to coerce its argument to text type and set its names and text_filter properties; it strips all other attributes. is_corpus_text returns TRUE or FALSE depending on whether its argument is of text type or not.

See also

as_utf8, text_filter, read_ndjson.

Examples

as_corpus_text("hello, world!")
#> [1] "hello, world!"
as_corpus_text(c(a = "goodnight", b = "moon")) # keeps names
#> a b #> "goodnight" "moon"
# set a filter property as_corpus_text(c(a = "goodnight", b = "moon"), stemmer = "english")
#> a b #> "goodnight" "moon"
is_corpus_text("hello") # FALSE, "hello" is character, not text
#> [1] FALSE