UTF-8 text validation, normalization, formatting, and printing.
as_utf8(x, normalize = FALSE) utf8_valid(x) utf8_normalize(x, map_case = FALSE, map_compat = FALSE, map_quote = FALSE, remove_ignorable = FALSE) utf8_encode(x, display = FALSE) utf8_format(x, trim = FALSE, chars = NULL, justify = "left", width = NULL, na.encode = TRUE, quote = FALSE, na.print = NULL, print.gap = NULL, ...) utf8_print(x, chars = NULL, quote = TRUE, na.print = NULL, print.gap = NULL, right = FALSE, max = NULL, display = TRUE, ...) utf8_width(x, encode = TRUE, quote = FALSE)
x | character object. |
---|---|
normalize | a logical value indicating whether to convert to Unicode composed normal form (NFC). |
map_case | a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents. |
map_compat | a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms. |
map_quote | a logical value indicating whether to replace curly single quotes and Unicode apostrophe characters with ASCII apostrophe (U+0027). |
remove_ignorable | a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens. |
display | logical scalar indicating whether to optimize the encoding for display, not byte-for-byte data transmission. |
trim | logical scalar indicating whether to suppress padding spaces around elements. |
chars | integer scalar indicating the maximum number of
character units to display. Wide characters like emoji take
two character units; combining marks and default ignorables
take none. Longer strings get truncated and suffixed or prefixed
with an ellipsis ( |
justify | justification; one of |
width | the minimum field width; set to |
na.encode | logical scalar indicating whether to encode
|
quote | logical scalar indicating whether to put surrounding
double-quotes ( |
na.print | character string (or |
print.gap | non-negative integer (or |
right | logical scalar indicating whether to right-justify character strings. |
max | non-negative integer (or |
encode | whether to encode the object before measuring its width. |
... | further arguments passed from other methods. Ignored. |
as_utf8
converts a character object from its declared encoding
to a valid UTF-8 character object, or throws an error if no conversion
is possible. If normalize = TRUE
, then the text gets
transformed to Unicode composed normal form (NFC) after conversion
to UTF-8.
utf8_valid
tests whether the elements of a character object
can be translated to valid UTF-8 strings.
utf8_normalize
converts the elements of a character object to
Unicode normalized composed form (NFC) while applying the character
maps specified by the map_case
, map_compat
,
map_quote
, and remove_ignorable
arguments.
utf8_encode
encodes a character object for printing on a UTF-8
device by escaping controls characters and other non-printable
characters. When display = TRUE
, the function optimizes the
encoding for display by removing default ignorable characters (soft
hyphens, zero-width spaces, etc.) and placing zero-width spaces after
wide emoji. When LC_CTYPE = "C"
, the function escapes all
non-ASCII characters and gives the same results on all platforms.
utf8_format
formats a character object for printing, optionally
truncating long character strings.
utf8_print
prints a character object after formatting it with
utf8_format
.
utf8_width
returns the printed widths of the elements of
a character object on a UTF-8 device or, when LC_CTYPE = "C"
,
on an ASCII device. If the string is not printable on the device,
for example if it contains a control code like "\n"
, then
the result is NA
. If encode = TRUE
, the default,
then the function returns the widths of the encoded elements
(via utf8_encode
); otherwise, the function returns the
widths of the original elements. If quote = TRUE
, then
utf8_width
returns the widths of the quoted values
(enclosing the argument in double quotes, and replacing internal
quotes with \"
).
For as_utf8
, utf8_normalize
, and utf8_encode
,
the result is a character object with the same attributes as
x
but with Encoding
set to "UTF-8"
.
For utf8_print
, the function returns x
invisibly.
For utf8_valid
or utf8_width
, a logical or integer
object, respectively, with the same names
, dim
, and
dimnames
as x
.
as_corpus_text
, iconv
.
# the second element is encoded in latin-1, but declared as UTF-8 x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile") Encoding(x) <- c("UTF-8", "UTF-8", "bytes") # attempt to convert to UTF-8 (fails)# NOT RUN { as_utf8(x) # }y <- x Encoding(y[2]) <- "latin1" # mark the correct encoding as_utf8(y) # succeeds#> [1] "façile" "façile" "façile"# test for valid UTF-8 utf8_valid(x)#> [1] TRUE FALSE TRUE# normalize text angstrom <- c("\u00c5", "\u0041\u030a", "\u212b") utf8_normalize(angstrom) == "\u00c5"#> [1] TRUE TRUE TRUE# encoding utf8_encode(x)#> [1] "façile" "fa\\xe7ile" "fa\\xc3\\xa7ile"# formatting utf8_format(x, chars = 3)#> [1] "faç…" "fa… " "fa\\xe2\\x80\\xa6 "utf8_format(x, chars = 3, justify = "centre", width = 10)#> [1] "faç…" "fa… " "fa\\xe2\\x80\\xa6 "utf8_format(x, chars = 3, justify = "right")#> [1] "…ile" "…ile" "\\xe2\\x80\\xa6ile"# get widths utf8_width(x)#> [1] 6 9 13utf8_width(x, encode = FALSE)#> [1] 6 NA 6utf8_width('"')#> [1] 1utf8_width('"', quote = TRUE)#> [1] 4# printing (assumes that output is capable of displaying Unicode 10.0.0) print(intToUtf8(0x1F600 + 0:79)) # with default R print function#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606\U0001f607\U0001f608\U0001f609\U0001f60a\U0001f60b\U0001f60c\U0001f60d\U0001f60e\U0001f60f\U0001f610\U0001f611\U0001f612\U0001f613\U0001f614\U0001f615\U0001f616\U0001f617\U0001f618\U0001f619\U0001f61a\U0001f61b\U0001f61c\U0001f61d\U0001f61e\U0001f61f\U0001f620\U0001f621\U0001f622\U0001f623\U0001f624\U0001f625\U0001f626\U0001f627\U0001f628\U0001f629\U0001f62a\U0001f62b\U0001f62c\U0001f62d\U0001f62e\U0001f62f\U0001f630\U0001f631\U0001f632\U0001f633\U0001f634\U0001f635\U0001f636\U0001f637\U0001f638\U0001f639\U0001f63a\U0001f63b\U0001f63c\U0001f63d\U0001f63e\U0001f63f\U0001f640\U0001f641\U0001f642\U0001f643\U0001f644\U0001f645\U0001f646\U0001f647\U0001f648\U0001f649\U0001f64a\U0001f64b\U0001f64c\U0001f64d\U0001f64e\U0001f64f"utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣…"utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"# in C locale, output ASCII (same results on all platforms) oldlocale <- Sys.getlocale("LC_CTYPE") invisible(Sys.setlocale("LC_CTYPE", "C")) # switch to C locale utf8_print(intToUtf8(0x1F600 + 0:79))#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606..."invisible(Sys.setlocale("LC_CTYPE", oldlocale)) # switch back to old locale