Segment text into smaller units.
text_split(x, units = "sentences", size = 1, filter = NULL, ...) text_nsentence(x, filter = NULL, ...)
x | a text or character vector. |
---|---|
units | the block size units, either |
size | the block size, a positive integer giving the maximum number of units per block. |
filter | if non- |
… | additional properties to set on the text filter. |
text_split
splits text into roughly evenly-sized blocks,
measured in the specified units. When units = "sentences"
,
units are sentences; when units = "tokens"
, units are
non-NA
tokens. The size
parameter specifies the
maximum block size.
When the minimum block size does not evenly divide the number of total units in a text, the block sizes will not be exactly equal. However, it will still be the case that no block will has more than one unit more than any other block. The extra units get allocated to the first segments in the split.
Sentences and tokens are defined by the filter
argument.
The documentation for text_tokens
describes the
tokenization rules. For sentence boundaries, see the
‘Sentences’ section below.
Sentences are defined according to a tailored version of the boundaries specified by Unicode Standard Annex #29, Section 5.
The UAX 29 sentence boundaries handle Unicode correctly and they give
reasonable behavior across a variety of languages, but they do not
handle abbreviations correctly and by default they treat carriage
returns and line feeds as paragraph separators, often leading to
incorrect breaks. To get around these shortcomings, the
text filter allows tailoring the UAX 29 rules using the
sent_crlf
and the sent_suppress
properties.
The UAX 29 rules break after full stops (periods) whenever they are
followed by uppercase letters. Under these rules, the text
"I saw Mr. Jones today."
gets split into two sentences. To get
around this, we allow a sent_suppress
property, a list of sentence
break suppressions which, when followed by uppercase characters, do
not signal the end of a sentence.
The UAX 29 rules also specify that a carriage return (CR) or line
feed (LF) indicates the end of of a sentence, so that
"A split\nsentence."
gets split into two sentences. This often
leads to incorrect breaks, so by default, with sent_crlf = FALSE
,
we deviate from the UAX 29 rules and we treat CR and LF like spaces.
To break sentences on CRLF, CR, and LF, specify sent_crlf = TRUE
.
text_split
returns a data frame with three columns named
parent
, index
, and text
, and one row for each
text block. The columns are as follows:
The parent
column is a factor. The levels of this
factor are the names of as_corpus_text(x)
. Calling
as.integer
on the parent column gives the indices of
the parent texts for the parent text for each sentence.
The index
column gives the integer index of the
sentence in its parent.
The text
value is the text of the block, a value of
type corpus_text
(not a character vector).
text_nsentence
returns a numeric vector with the same length
as x
with each element giving the number of sentences in the
corresponding text.
text <- c("I saw Mr. Jones today.", "Split across\na line.", "What. Are. You. Doing????", "She asked 'do you really mean that?' and I said 'yes.'") # split text into sentences text_split(text, units = "sentences")#> parent index text #> 1 1 1 I saw Mr. Jones today. #> 2 2 1 Split across\na line. #> 3 3 1 What. #> 4 3 2 Are. #> 5 3 3 You. #> 6 3 4 Doing???? #> 7 4 1 She asked 'do you really mean that?' #> 8 4 2 and I said 'yes.'# get the number of sentences text_nsentence(text)#> [1] 1 1 4 2# disable the default sentence suppressions text_split("I saw Mr. Jones today.", units = "sentences", filter = NULL)#> parent index text #> 1 1 1 I saw Mr. Jones today.# break on CR and LF text_split("Split across\na line.", units = "sentences", filter = text_filter(sent_crlf = TRUE))#> parent index text #> 1 1 1 Split across\n #> 2 1 2 a line.# 2-sentence blocks text_split(c("What. Are. You. Doing????", "She asked 'do you really mean that?' and I said 'yes.'"), units = "sentences", size = 2)#> parent index text #> 1 1 1 What. Are. #> 2 1 2 You. Doing???? #> 3 2 1 She asked 'do you really mean that?' and I said 'yes.'# 4-token blocks text_split(c("What. Are. You. Doing????", "She asked 'do you really mean that?' and I said 'yes.'"), units = "tokens", size = 4)#> parent index text #> 1 1 1 What. Are. #> 2 1 2 You. Doing? #> 3 1 3 ??? #> 4 2 1 She asked 'do #> 5 2 2 you really mean that #> 6 2 3 ?' and #> 7 2 4 I said ' #> 8 2 5 yes.'# blocks are approximately evenly sized; 'size' gives maximum size text_split(paste(letters, collapse = " "), "tokens", 4)#> parent index text #> 1 1 1 a b c d #> 2 1 2 e f g h #> 3 1 3 i j k l #> 4 1 4 m n o p #> 5 1 5 q r s t #> 6 1 6 u v w #> 7 1 7 x y ztext_split(paste(letters, collapse = " "), "tokens", 16)#> parent index text #> 1 1 1 a b c d e f g h i j k l m #> 2 1 2 n o p q r s t u v w x y z