Read data from a file in newline-delimited JavaScript Object Notation (NDJSON) format.
read_ndjson(file, mmap = FALSE, simplify = TRUE, text = NULL)
| file | the name of the file which the data are to be read from,
or a connection (unless |
|---|---|
| mmap | whether to memory-map the file instead of reading all of its data into memory simultaneously. See the ‘Memory mapping’ section. |
| simplify | whether to attempt to simplify the type of the return
value. For example, if each line of the file stores an integer,
if |
| text | a character vector of string fields to interpret as
|
This function is the recommended means of reading data for processing by the corpus package.
When the text argument is non-NULL string data
fields with names indicated by this argument are decoded as
text values, not as character values.
When you specify mmap = TRUE, the function memory-maps the file
instead of reading it into memory directly. In this case, the file
argument must be a character string giving the path to the file, not
a connection object. When you memory-map the file, the operating
system reads data into memory only when it is needed, enabling
you to transparently process large data sets that do not fit into
memory.
In terms of memory usage, enabling mmap = TRUE reduces the
footprint for corpus_json and corpus_text objects;
native R objects (character, integer, list,
logical, and numeric) get fully deserialized to
memory and produce identical results regardless of whether
mmap is TRUE or FALSE. To process a large
text corpus with a text field named "text", you should set
text = "text" and mmap = TRUE. Or, to reduce the memory
footprint even further, set simplify = FALSE and
mmap = TRUE.
One danger in memory-mapping is that if you delete the file
after calling read_ndjson but before processing the data, then
the results will be undefined, and your computer may crash. (On
POSIX-compliant systems like Mac OS and Linux, there should be no
ill effects to deleting the file. On recent versions of Windows,
the system will not allow you to delete the file as long as the data
is active.)
Another danger in memory-mapping is that if you serialize a
corpus_json object or derived corpus_text object using
saveRDS or another similar function, and then you
deserialize the object, R will attempt create a new memory-map
using the file argument passed to the original read_ndjson
call. If file is a relative path, then your working directory
at the time of deserialization must agree with your working directory
at the time of the read_ndjson call. You can avoid this
situation by specifying an absolute path as the file argument
(the normalizePath function will convert a relative
to an absolute path).
In the default usage, with argument simplify = TRUE, when
the lines of the file are records (JSON object literals), the
return value from read_ndjson is a data frame with class
c("corpus_frame", "data.frame"). With simplify = FALSE,
the result is a corpus_json object.
as_corpus_text, as_utf8.
# Memory mapping lines <- c('{ "a": 1, "b": true }', '{ "b": false, "nested": { "c": 100, "d": false }}', '{ "a": 3.14, "nested": { "d": true }}') file <- tempfile() writeLines(lines, file) (data <- read_ndjson(file, mmap = TRUE))#> a b nested.c nested.d #> 1 1.00 TRUE NA NA #> 2 NA FALSE 100 FALSE #> 3 3.14 NA NA TRUEdata$a#> [1] 1.00 NA 3.14data$b#> [1] TRUE FALSE NAdata$nested.c#> [1] NA 100 NAdata$nested.d#> [1] NA FALSE TRUErm("data") invisible(gc()) # force the garbage collector to release the memory-map file.remove(file)#> [1] TRUE