Get or specify the process by which text gets transformed into a sequence of tokens or sentences.
text_filter(x = NULL, ...) text_filter(x) <- value # S3 method for corpus_text text_filter(x = NULL, ...) # S3 method for data.frame text_filter(x = NULL, ...) # S3 method for default text_filter(x = NULL, ..., map_case = TRUE, map_quote = TRUE, remove_ignorable = TRUE, combine = NULL, stemmer = NULL, stem_dropped = FALSE, stem_except = NULL, drop_letter = FALSE, drop_number = FALSE, drop_punct = FALSE, drop_symbol = FALSE, drop = NULL, drop_except = NULL, connector = "_", sent_crlf = FALSE, sent_suppress = corpus::abbreviations_en)
x | text or corpus object. |
---|---|
value | text filter object, or |
... | further arguments passed to or from other methods. |
map_case | a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents. |
map_quote | a logical value indicating whether to replace curly single quotes and other Unicode apostrophe characters with ASCII apostrophe (U+0027). |
remove_ignorable | a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens. |
combine | a character vector of multi-word phrases to combine, or
|
stemmer | a character value giving the name of a Snowball stemming
algorithm (see |
stem_dropped | a logical value indicating whether to stem words
in the |
stem_except | a character vector of exception words to exempt from
stemming, or |
drop_letter | a logical value indicating whether to replace
|
drop_number | a logical value indicating whether to replace
|
drop_punct | a logical value indicating whether to replace
|
drop_symbol | a logical value indicating whether to replace
|
drop | a character vector of types to replace with |
drop_except | a character of types to exempt from the drop
rules specified by the |
connector | a character to use as a connector in lieu of white space for types that stem to multi-word phrases. |
sent_crlf | a logical value indicating whether to break sentences on carriage returns or line feeds. |
sent_suppress | a character vector of sentence break suppressions. |
The set of properties in a text filter determine the tokenization
and sentence breaking rules. See the documentation for
text_tokens
and text_split
for details
on the tokenization process.
text_filter
retrieves an objects text filter, optionally
with modifications to some of its properties.
text_filter<-
sets an object's text filter. Setting the
text filter on a character object is not allowed; the object must
have type "corpus_text"
or be a data frame with a "text"
column of type "corpus_text"
.
as_corpus_text
, text_tokens
,
text_split
, abbreviations
,
stopwords
.
# text filter with default options set text_filter()#> Text filter with the following options: #> #> map_case: TRUE #> map_quote: TRUE #> remove_ignorable: TRUE #> combine: NULL #> stemmer: NULL #> stem_dropped: FALSE #> stem_except: NULL #> drop_letter: FALSE #> drop_number: FALSE #> drop_punct: FALSE #> drop_symbol: FALSE #> drop: NULL #> drop_except: NULL #> connector: _ #> sent_crlf: FALSE #> sent_suppress: chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." ...# specify some options but leave others unchanged f <- text_filter(map_case = FALSE, drop = stopwords_en) # set the text filter property x <- as_corpus_text(c("Marnie the Dog is #1 on the internet.")) text_filter(x) <- f text_tokens(x) # by default, uses x's text_filter to tokenize#> [[1]] #> [1] "Marnie" "Dog" "#" "1" "internet" "." #># change a filter property f2 <- text_filter(x, map_case = TRUE) # equivalent to: # f2 <- text_filter(x) # f2$map_case <- TRUE text_tokens(x, f2) # override text_filter(x)#> [[1]] #> [1] "marnie" "dog" "#" "1" "internet" "." #># setting text_filter on a data frame is allowed if it has a # column names "text" of type "corpus_text" d <- data.frame(text = x) text_filter(d) <- f2 text_tokens(d)#> [[1]] #> [1] "marnie" "dog" "#" "1" "internet" "." #># but you can't set text filters on character objects y <- "hello world"# NOT RUN { text_filter(y) <- f2 # gives an error # }d2 <- data.frame(text = "hello world", stringsAsFactors = FALSE)# NOT RUN { text_filter(d2) <- f2 # gives an error # }