UTF-8 Character Handling

UTF-8 text validation, normalization, formatting, and printing.

as_utf8(x, normalize = FALSE)

utf8_valid(x)

utf8_normalize(x, map_case = FALSE, map_compat = FALSE,
               map_quote = FALSE, remove_ignorable = FALSE)

utf8_encode(x, display = FALSE)

utf8_format(x, trim = FALSE, chars = NULL, justify = "left",
            width = NULL, na.encode = TRUE, quote = FALSE,
            na.print = NULL, print.gap = NULL, ...)

utf8_print(x, chars = NULL, quote = TRUE, na.print = NULL,
           print.gap = NULL, right = FALSE, max = NULL,
           display = TRUE, ...)

utf8_width(x, encode = TRUE, quote = FALSE)

Arguments

x	character object.
normalize	a logical value indicating whether to convert to Unicode composed normal form (NFC).
map_case	a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
map_compat	a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
map_quote	a logical value indicating whether to replace curly single quotes and Unicode apostrophe characters with ASCII apostrophe (U+0027).
remove_ignorable	a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
display	logical scalar indicating whether to optimize the encoding for display, not byte-for-byte data transmission.
trim	logical scalar indicating whether to suppress padding spaces around elements.
chars	integer scalar indicating the maximum number of character units to display. Wide characters like emoji take two character units; combining marks and default ignorables take none. Longer strings get truncated and suffixed or prefixed with an ellipsis (`"..."` in C locale, `"\u2026"` in others). Set to `NULL` to limit output to the line width as determined by `getOption("width")`.
justify	justification; one of `"left"`, `"right"`, `"centre"`, or `"none"`. Can be abbreviated.
width	the minimum field width; set to `NULL` or `0` for no restriction.
na.encode	logical scalar indicating whether to encode `NA` values as character strings.
quote	logical scalar indicating whether to put surrounding double-quotes (`'"'`) around character strings and escape internal double-quotes.
na.print	character string (or `NULL`) indicating the encoding for `NA` values. Ignored when `na.encode` is `FALSE`.
print.gap	non-negative integer (or `NULL`) giving the number of spaces in gaps between columns; set to `NULL` or `1` for a single space.
right	logical scalar indicating whether to right-justify character strings.
max	non-negative integer (or `NULL`) indicating the maximum number of elements to print; set to `getOption("max.print")` if argument is `NULL`.
encode	whether to encode the object before measuring its width.
...	further arguments passed from other methods. Ignored.

Details

as_utf8 converts a character object from its declared encoding to a valid UTF-8 character object, or throws an error if no conversion is possible. If normalize = TRUE, then the text gets transformed to Unicode composed normal form (NFC) after conversion to UTF-8.

utf8_valid tests whether the elements of a character object can be translated to valid UTF-8 strings.

utf8_normalize converts the elements of a character object to Unicode normalized composed form (NFC) while applying the character maps specified by the map_case, map_compat, map_quote, and remove_ignorable arguments.

utf8_encode encodes a character object for printing on a UTF-8 device by escaping controls characters and other non-printable characters. When display = TRUE, the function optimizes the encoding for display by removing default ignorable characters (soft hyphens, zero-width spaces, etc.) and placing zero-width spaces after wide emoji. When LC_CTYPE = "C", the function escapes all non-ASCII characters and gives the same results on all platforms.

utf8_format formats a character object for printing, optionally truncating long character strings.

utf8_print prints a character object after formatting it with utf8_format.

utf8_width returns the printed widths of the elements of a character object on a UTF-8 device or, when LC_CTYPE = "C", on an ASCII device. If the string is not printable on the device, for example if it contains a control code like "\n", then the result is NA. If encode = TRUE, the default, then the function returns the widths of the encoded elements (via utf8_encode); otherwise, the function returns the widths of the original elements. If quote = TRUE, then utf8_width returns the widths of the quoted values (enclosing the argument in double quotes, and replacing internal quotes with \").

Value

For as_utf8, utf8_normalize, and utf8_encode, the result is a character object with the same attributes as x but with Encoding set to "UTF-8".

For utf8_print, the function returns x invisibly.

For utf8_valid or utf8_width, a logical or integer object, respectively, with the same names, dim, and dimnames as x.

Examples

# the second element is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")

# attempt to convert to UTF-8 (fails)
# NOT RUN {
as_utf8(x)
# }
y <- x
Encoding(y[2]) <- "latin1" # mark the correct encoding
as_utf8(y) # succeeds
#> [1] "façile" "façile" "façile"

# test for valid UTF-8
utf8_valid(x)
#> [1]  TRUE FALSE  TRUE

# normalize text
angstrom <- c("\u00c5", "\u0041\u030a", "\u212b")
utf8_normalize(angstrom) == "\u00c5"
#> [1] TRUE TRUE TRUE

# encoding
utf8_encode(x)
#> [1] "façile"          "fa\\xe7ile"      "fa\\xc3\\xa7ile"

# formatting
utf8_format(x, chars = 3)
#> [1] "faç…"            "fa… "            "fa\\xe2\\x80\\xa6 "
utf8_format(x, chars = 3, justify = "centre", width = 10)
#> [1] "faç…"            "fa… "            "fa\\xe2\\x80\\xa6 "
utf8_format(x, chars = 3, justify = "right")
#> [1] "…ile"            "…ile"            "\\xe2\\x80\\xa6ile"

# get widths
utf8_width(x)
#> [1]  6  9 13
utf8_width(x, encode = FALSE)
#> [1]  6 NA  6
utf8_width('"')
#> [1] 1
utf8_width('"', quote = TRUE)
#> [1] 4

# printing (assumes that output is capable of displaying Unicode 10.0.0)
print(intToUtf8(0x1F600 + 0:79)) # with default R print function
#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606\U0001f607\U0001f608\U0001f609\U0001f60a\U0001f60b\U0001f60c\U0001f60d\U0001f60e\U0001f60f\U0001f610\U0001f611\U0001f612\U0001f613\U0001f614\U0001f615\U0001f616\U0001f617\U0001f618\U0001f619\U0001f61a\U0001f61b\U0001f61c\U0001f61d\U0001f61e\U0001f61f\U0001f620\U0001f621\U0001f622\U0001f623\U0001f624\U0001f625\U0001f626\U0001f627\U0001f628\U0001f629\U0001f62a\U0001f62b\U0001f62c\U0001f62d\U0001f62e\U0001f62f\U0001f630\U0001f631\U0001f632\U0001f633\U0001f634\U0001f635\U0001f636\U0001f637\U0001f638\U0001f639\U0001f63a\U0001f63b\U0001f63c\U0001f63d\U0001f63e\U0001f63f\U0001f640\U0001f641\U0001f642\U0001f643\U0001f644\U0001f645\U0001f646\U0001f647\U0001f648\U0001f649\U0001f64a\U0001f64b\U0001f64c\U0001f64d\U0001f64e\U0001f64f"
utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣…"
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"

# in C locale, output ASCII (same results on all platforms)
oldlocale <- Sys.getlocale("LC_CTYPE")
invisible(Sys.setlocale("LC_CTYPE", "C")) # switch to C locale
utf8_print(intToUtf8(0x1F600 + 0:79))
#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606..."
invisible(Sys.setlocale("LC_CTYPE", oldlocale)) # switch back to old locale

Arguments

Details

Value

See also

Examples

Contents