Read data from a file in newline-delimited JavaScript Object Notation (NDJSON) format.
read_ndjson(file, mmap = FALSE, simplify = TRUE, text = NULL)
file | the name of the file which the data are to be read from,
or a connection (unless |
---|---|
mmap | whether to memory-map the file instead of reading all of its data into memory simultaneously. See the ‘Memory mapping’ section. |
simplify | whether to attempt to simplify the type of the return
value. For example, if each line of the file stores an integer,
if |
text | a character vector of string fields to interpret as
|
This function is the recommended means of reading data for processing by the corpus package.
When the text
argument is non-NULL
string data
fields with names indicated by this argument are decoded as
text
values, not as character
values.
When you specify mmap = TRUE
, the function memory-maps the file
instead of reading it into memory directly. In this case, the file
argument must be a character string giving the path to the file, not
a connection object. When you memory-map the file, the operating
system reads data into memory only when it is needed, enabling
you to transparently process large data sets that do not fit into
memory.
In terms of memory usage, enabling mmap = TRUE
reduces the
footprint for corpus_json
and corpus_text
objects;
native R objects (character
, integer
, list
,
logical
, and numeric
) get fully deserialized to
memory and produce identical results regardless of whether
mmap
is TRUE
or FALSE
. To process a large
text corpus with a text field named "text"
, you should set
text = "text"
and mmap = TRUE
. Or, to reduce the memory
footprint even further, set simplify = FALSE
and
mmap = TRUE
.
One danger in memory-mapping is that if you delete the file
after calling read_ndjson
but before processing the data, then
the results will be undefined, and your computer may crash. (On
POSIX-compliant systems like Mac OS and Linux, there should be no
ill effects to deleting the file. On recent versions of Windows,
the system will not allow you to delete the file as long as the data
is active.)
Another danger in memory-mapping is that if you serialize a
corpus_json
object or derived corpus_text
object using
saveRDS
or another similar function, and then you
deserialize the object, R will attempt create a new memory-map
using the file
argument passed to the original read_ndjson
call. If file
is a relative path, then your working directory
at the time of deserialization must agree with your working directory
at the time of the read_ndjson
call. You can avoid this
situation by specifying an absolute path as the file
argument
(the normalizePath
function will convert a relative
to an absolute path).
In the default usage, with argument simplify = TRUE
, when
the lines of the file are records (JSON object literals), the
return value from read_ndjson
is a data frame with class
c("corpus_frame", "data.frame")
. With simplify = FALSE
,
the result is a corpus_json
object.
as_corpus_text
, as_utf8
.
# Memory mapping lines <- c('{ "a": 1, "b": true }', '{ "b": false, "nested": { "c": 100, "d": false }}', '{ "a": 3.14, "nested": { "d": true }}') file <- tempfile() writeLines(lines, file) (data <- read_ndjson(file, mmap = TRUE))#> a b nested.c nested.d #> 1 1.00 TRUE NA NA #> 2 NA FALSE 100 FALSE #> 3 3.14 NA NA TRUEdata$a#> [1] 1.00 NA 3.14data$b#> [1] TRUE FALSE NAdata$nested.c#> [1] NA 100 NAdata$nested.d#> [1] NA FALSE TRUErm("data") invisible(gc()) # force the garbage collector to release the memory-map file.remove(file)#> [1] TRUE