Segment text into smaller units.
text_split(x, units = "sentences", size = 1, filter = NULL, ...) text_nsentence(x, filter = NULL, ...)
| x | a text or character vector. |
|---|---|
| units | the block size units, either |
| size | the block size, a positive integer giving the maximum number of units per block. |
| filter | if non- |
| … | additional properties to set on the text filter. |
text_split splits text into roughly evenly-sized blocks,
measured in the specified units. When units = "sentences",
units are sentences; when units = "tokens", units are
non-NA tokens. The size parameter specifies the
maximum block size.
When the minimum block size does not evenly divide the number of total units in a text, the block sizes will not be exactly equal. However, it will still be the case that no block will has more than one unit more than any other block. The extra units get allocated to the first segments in the split.
Sentences and tokens are defined by the filter argument.
The documentation for text_tokens describes the
tokenization rules. For sentence boundaries, see the
‘Sentences’ section below.
Sentences are defined according to a tailored version of the boundaries specified by Unicode Standard Annex #29, Section 5.
The UAX 29 sentence boundaries handle Unicode correctly and they give
reasonable behavior across a variety of languages, but they do not
handle abbreviations correctly and by default they treat carriage
returns and line feeds as paragraph separators, often leading to
incorrect breaks. To get around these shortcomings, the
text filter allows tailoring the UAX 29 rules using the
sent_crlf and the sent_suppress properties.
The UAX 29 rules break after full stops (periods) whenever they are
followed by uppercase letters. Under these rules, the text
"I saw Mr. Jones today." gets split into two sentences. To get
around this, we allow a sent_suppress property, a list of sentence
break suppressions which, when followed by uppercase characters, do
not signal the end of a sentence.
The UAX 29 rules also specify that a carriage return (CR) or line
feed (LF) indicates the end of of a sentence, so that
"A split\nsentence." gets split into two sentences. This often
leads to incorrect breaks, so by default, with sent_crlf = FALSE,
we deviate from the UAX 29 rules and we treat CR and LF like spaces.
To break sentences on CRLF, CR, and LF, specify sent_crlf = TRUE.
text_split returns a data frame with three columns named
parent, index, and text, and one row for each
text block. The columns are as follows:
The parent column is a factor. The levels of this
factor are the names of as_corpus_text(x). Calling
as.integer on the parent column gives the indices of
the parent texts for the parent text for each sentence.
The index column gives the integer index of the
sentence in its parent.
The text value is the text of the block, a value of
type corpus_text (not a character vector).
text_nsentence returns a numeric vector with the same length
as x with each element giving the number of sentences in the
corresponding text.
text <- c("I saw Mr. Jones today.", "Split across\na line.", "What. Are. You. Doing????", "She asked 'do you really mean that?' and I said 'yes.'") # split text into sentences text_split(text, units = "sentences")#> parent index text #> 1 1 1 I saw Mr. Jones today. #> 2 2 1 Split across\na line. #> 3 3 1 What. #> 4 3 2 Are. #> 5 3 3 You. #> 6 3 4 Doing???? #> 7 4 1 She asked 'do you really mean that?' #> 8 4 2 and I said 'yes.'# get the number of sentences text_nsentence(text)#> [1] 1 1 4 2# disable the default sentence suppressions text_split("I saw Mr. Jones today.", units = "sentences", filter = NULL)#> parent index text #> 1 1 1 I saw Mr. Jones today.# break on CR and LF text_split("Split across\na line.", units = "sentences", filter = text_filter(sent_crlf = TRUE))#> parent index text #> 1 1 1 Split across\n #> 2 1 2 a line.# 2-sentence blocks text_split(c("What. Are. You. Doing????", "She asked 'do you really mean that?' and I said 'yes.'"), units = "sentences", size = 2)#> parent index text #> 1 1 1 What. Are. #> 2 1 2 You. Doing???? #> 3 2 1 She asked 'do you really mean that?' and I said 'yes.'# 4-token blocks text_split(c("What. Are. You. Doing????", "She asked 'do you really mean that?' and I said 'yes.'"), units = "tokens", size = 4)#> parent index text #> 1 1 1 What. Are. #> 2 1 2 You. Doing? #> 3 1 3 ??? #> 4 2 1 She asked 'do #> 5 2 2 you really mean that #> 6 2 3 ?' and #> 7 2 4 I said ' #> 8 2 5 yes.'# blocks are approximately evenly sized; 'size' gives maximum size text_split(paste(letters, collapse = " "), "tokens", 4)#> parent index text #> 1 1 1 a b c d #> 2 1 2 e f g h #> 3 1 3 i j k l #> 4 1 4 m n o p #> 5 1 5 q r s t #> 6 1 6 u v w #> 7 1 7 x y ztext_split(paste(letters, collapse = " "), "tokens", 16)#> parent index text #> 1 1 1 a b c d e f g h i j k l m #> 2 1 2 n o p q r s t u v w x y z