UTF-8 text validation, normalization, formatting, and printing.

as_utf8(x, normalize = FALSE)

utf8_valid(x)

utf8_normalize(x, map_case = FALSE, map_compat = FALSE,
               map_quote = FALSE, remove_ignorable = FALSE)

utf8_encode(x, display = FALSE)

utf8_format(x, trim = FALSE, chars = NULL, justify = "left",
            width = NULL, na.encode = TRUE, quote = FALSE,
            na.print = NULL, print.gap = NULL, ...)

utf8_print(x, chars = NULL, quote = TRUE, na.print = NULL,
           print.gap = NULL, right = FALSE, max = NULL,
           display = TRUE, ...)

utf8_width(x, encode = TRUE, quote = FALSE)

Arguments

x

character object.

normalize

a logical value indicating whether to convert to Unicode composed normal form (NFC).

map_case

a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.

map_compat

a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.

map_quote

a logical value indicating whether to replace curly single quotes and Unicode apostrophe characters with ASCII apostrophe (U+0027).

remove_ignorable

a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.

display

logical scalar indicating whether to optimize the encoding for display, not byte-for-byte data transmission.

trim

logical scalar indicating whether to suppress padding spaces around elements.

chars

integer scalar indicating the maximum number of character units to display. Wide characters like emoji take two character units; combining marks and default ignorables take none. Longer strings get truncated and suffixed or prefixed with an ellipsis ("..." in C locale, "\u2026" in others). Set to NULL to limit output to the line width as determined by getOption("width").

justify

justification; one of "left", "right", "centre", or "none". Can be abbreviated.

width

the minimum field width; set to NULL or 0 for no restriction.

na.encode

logical scalar indicating whether to encode NA values as character strings.

quote

logical scalar indicating whether to put surrounding double-quotes ('"') around character strings and escape internal double-quotes.

na.print

character string (or NULL) indicating the encoding for NA values. Ignored when na.encode is FALSE.

print.gap

non-negative integer (or NULL) giving the number of spaces in gaps between columns; set to NULL or 1 for a single space.

right

logical scalar indicating whether to right-justify character strings.

max

non-negative integer (or NULL) indicating the maximum number of elements to print; set to getOption("max.print") if argument is NULL.

encode

whether to encode the object before measuring its width.

...

further arguments passed from other methods. Ignored.

Details

as_utf8 converts a character object from its declared encoding to a valid UTF-8 character object, or throws an error if no conversion is possible. If normalize = TRUE, then the text gets transformed to Unicode composed normal form (NFC) after conversion to UTF-8.

utf8_valid tests whether the elements of a character object can be translated to valid UTF-8 strings.

utf8_normalize converts the elements of a character object to Unicode normalized composed form (NFC) while applying the character maps specified by the map_case, map_compat, map_quote, and remove_ignorable arguments.

utf8_encode encodes a character object for printing on a UTF-8 device by escaping controls characters and other non-printable characters. When display = TRUE, the function optimizes the encoding for display by removing default ignorable characters (soft hyphens, zero-width spaces, etc.) and placing zero-width spaces after wide emoji. When LC_CTYPE = "C", the function escapes all non-ASCII characters and gives the same results on all platforms.

utf8_format formats a character object for printing, optionally truncating long character strings.

utf8_print prints a character object after formatting it with utf8_format.

utf8_width returns the printed widths of the elements of a character object on a UTF-8 device or, when LC_CTYPE = "C", on an ASCII device. If the string is not printable on the device, for example if it contains a control code like "\n", then the result is NA. If encode = TRUE, the default, then the function returns the widths of the encoded elements (via utf8_encode); otherwise, the function returns the widths of the original elements. If quote = TRUE, then utf8_width returns the widths of the quoted values (enclosing the argument in double quotes, and replacing internal quotes with \").

Value

For as_utf8, utf8_normalize, and utf8_encode, the result is a character object with the same attributes as x but with Encoding set to "UTF-8".

For utf8_print, the function returns x invisibly.

For utf8_valid or utf8_width, a logical or integer object, respectively, with the same names, dim, and dimnames as x.

See also

as_corpus_text, iconv.

Examples

# the second element is encoded in latin-1, but declared as UTF-8 x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile") Encoding(x) <- c("UTF-8", "UTF-8", "bytes") # attempt to convert to UTF-8 (fails)
# NOT RUN { as_utf8(x) # }
y <- x Encoding(y[2]) <- "latin1" # mark the correct encoding as_utf8(y) # succeeds
#> [1] "façile" "façile" "façile"
# test for valid UTF-8 utf8_valid(x)
#> [1] TRUE FALSE TRUE
# normalize text angstrom <- c("\u00c5", "\u0041\u030a", "\u212b") utf8_normalize(angstrom) == "\u00c5"
#> [1] TRUE TRUE TRUE
# encoding utf8_encode(x)
#> [1] "façile" "fa\\xe7ile" "fa\\xc3\\xa7ile"
# formatting utf8_format(x, chars = 3)
#> [1] "faç…" "fa… " "fa\\xe2\\x80\\xa6 "
utf8_format(x, chars = 3, justify = "centre", width = 10)
#> [1] "faç…" "fa… " "fa\\xe2\\x80\\xa6 "
utf8_format(x, chars = 3, justify = "right")
#> [1] "…ile" "…ile" "\\xe2\\x80\\xa6ile"
# get widths utf8_width(x)
#> [1] 6 9 13
utf8_width(x, encode = FALSE)
#> [1] 6 NA 6
utf8_width('"')
#> [1] 1
utf8_width('"', quote = TRUE)
#> [1] 4
# printing (assumes that output is capable of displaying Unicode 10.0.0) print(intToUtf8(0x1F600 + 0:79)) # with default R print function
#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606\U0001f607\U0001f608\U0001f609\U0001f60a\U0001f60b\U0001f60c\U0001f60d\U0001f60e\U0001f60f\U0001f610\U0001f611\U0001f612\U0001f613\U0001f614\U0001f615\U0001f616\U0001f617\U0001f618\U0001f619\U0001f61a\U0001f61b\U0001f61c\U0001f61d\U0001f61e\U0001f61f\U0001f620\U0001f621\U0001f622\U0001f623\U0001f624\U0001f625\U0001f626\U0001f627\U0001f628\U0001f629\U0001f62a\U0001f62b\U0001f62c\U0001f62d\U0001f62e\U0001f62f\U0001f630\U0001f631\U0001f632\U0001f633\U0001f634\U0001f635\U0001f636\U0001f637\U0001f638\U0001f639\U0001f63a\U0001f63b\U0001f63c\U0001f63d\U0001f63e\U0001f63f\U0001f640\U0001f641\U0001f642\U0001f643\U0001f644\U0001f645\U0001f646\U0001f647\U0001f648\U0001f649\U0001f64a\U0001f64b\U0001f64c\U0001f64d\U0001f64e\U0001f64f"
utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​…"
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​😤​😥​😦​😧​😨​😩​😪​😫​😬​😭​😮​😯​😰​😱​😲​😳​😴​😵​😶​😷​😸​😹​😺​😻​😼​😽​😾​😿​🙀​🙁​🙂​🙃​🙄​🙅​🙆​🙇​🙈​🙉​🙊​🙋​🙌​🙍​🙎​🙏​"
# in C locale, output ASCII (same results on all platforms) oldlocale <- Sys.getlocale("LC_CTYPE") invisible(Sys.setlocale("LC_CTYPE", "C")) # switch to C locale utf8_print(intToUtf8(0x1F600 + 0:79))
#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606..."
invisible(Sys.setlocale("LC_CTYPE", oldlocale)) # switch back to old locale