Segment text into tokens, each of which is an instance of a particular ‘type’.
text_tokens(x, filter = NULL, ...) text_ntoken(x, filter = NULL, ...)
x | object to be tokenized. |
---|---|
filter | if non- |
… | additional properties to set on the text filter. |
text_tokens
splits texts into token sequences. Each token is an
instance of a particular type. This operation proceeds in a series
of stages, controlled by the filter
argument:
First, we segment the text into words and spaces using the boundaries defined by Unicode Standard Annex #29, Section 4, with special handling for @mentions, #hashtags, and URLs.
Next, we normalize the words by applying the character mappings
indicated by the map_case
, map_quote
, and
remove_ignorable
properties. We replace sequences of spaces
by a space (U+0020). At the end of the second stage,
we have segmented the text into a sequence of normalized words and
spaces, in Unicode composed normal form (NFC).
In the third stage, if the combine
property is non-NULL
,
we scan the word sequence from left to right, searching for the longest
possible match in the combine
list. If a match exists, we
replace the word sequence with a single token for that term;
otherwise, we leave the word as-is. We drop spaces at this point, unless
they are part of a multi-word term. See the ‘Combining words’
section below for more details.
Next, if the stemmer
property is non-NULL
, we apply
the indicated stemming algorithm to each word that does not match
one of the elements of the stem_except
character vector. Terms
that stem to NA
get dropped from the sequence.
After stemming, we categorize each remaining token as
"letter"
, "number"
, "punct"
, or "symbol"
according to the first character in the word. For words that start with
extenders like underscore (_
), use the first non-extender to
classify it.
If any of drop_letter
, drop_number
, drop_punct
,
or drop_symbol
are TRUE
, then we drop the tokens in the
corresponding categories. We also drop any terms that match an element
of the drop
character vector. We can add exceptions to the
drop rules by specifying a non-NULL
value for the
drop_except
property: drop_except
is a character
vector, then we we restore tokens that match elements of vector to
their values prior to dropping.
Finally, we replace sequences of white-space in the terms with
the specified connector
, which defaults to a low line character
(_
, U+005F).
Multi-word terms specified by the combine
property can be specified as
tokens, prior to normalization. Terms specified by the stem_except
,
drop
, and drop_except
need to be normalized and stemmed (if
stemmer
is non-NULL
). Thus, for example, if
map_case = TRUE
, then a token filter with combine = "Mx."
produces the same results as a token filter with combine = "mx."
.
However, drop = "Mx."
behaves different from drop = "mx."
.
The combine
property of a text_filter
enables
transformations that combine two or more words into a single token. For
example, specifying combine = "new york"
will
cause consecutive instances of the words new
and york
to get replaced by a single token, new york
.
text_tokens
returns a list of the same length as x
, with
the same names. Each list item is a character vector with the tokens
for the corresponding element of x
.
text_ntoken
returns a numeric vector the same length as x
,
with each element giving the number of tokens in the corresponding text.
stopwords
, text_filter
,
text_types
.
text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?")#> [[1]] #> [1] "the" "quick" "(" "'" "brown" "'" ")" "fox" "can't" #> [10] "jump" "32.3" "feet" "," "right" "?" #># count tokens: text_ntoken("The quick ('brown') fox can't jump 32.3 feet, right?")#> [1] 15# don't change case or quotes: f <- text_filter(map_case = FALSE, map_quote = FALSE) text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?", f)#> [[1]] #> [1] "The" "quick" "(" "'" "brown" "'" ")" "fox" "can't" #> [10] "jump" "32.3" "feet" "," "right" "?" #># drop common function words ('stop' words): text_tokens("Able was I ere I saw Elba.", text_filter(drop = stopwords_en))#> [[1]] #> [1] "able" "ere" "saw" "elba" "." #># drop numbers, with some exceptions:" text_tokens("0, 1, 2, 3, 4, 5", text_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))#> [[1]] #> [1] "0" "," "," "2" "," "," "4" "," #>#> [[1]] #> [1] "mari" "is" "run" #># ...except for certain words text_tokens("Mary is running", text_filter(stemmer = "english", stem_except = "mary"))#> [[1]] #> [1] "mary" "is" "run" #># default tokenization text_tokens("Ms. Jones")#> [[1]] #> [1] "ms" "." "jones" #>#> [[1]] #> [1] "ms." "jones" #># add custom combinations text_tokens("Ms. Jones is from New York City, New York.", text_filter(combine = c(abbreviations_en, "new york", "new york city")))#> [[1]] #> [1] "ms." "jones" "is" "from" #> [5] "new_york_city" "," "new_york" "." #>