Implement length<-
for corpus_text
objects.
Added str()
method for corpus_text
objects. Currently just a minimal implementation; this may change in the future.
Add ANSI styling to corpus_frame
objects.
Remove text_length()
. Use text_ntoken()
instead.
Remove as_utf8()
, utf8_valid()
, utf8_normalize()
, utf8_encode()
, utf8_format()
, utf8_print()
, and utf8_width()
; these functions are in the utf8 package now.
Fix bug in print.corpus_frame(,row.names = FALSE)
.
Fix failing test on R-devel.
Fix failing tests on testthat 2.0.0.
weights
argument from term_stats()
and term_matrix()
.Allow user-supplied stemming functions in text_filter()
.
Add new_stemmer()
function to make a stemming function from a set of (term, stem) pairs.
Add stem_snowball()
function for the Snowball stemming algorithms (similar to SnowballC::wordStem, but only stemming “letter” tokens, not “number”, “punct”, or “symbol”).
Apply filter combine rules before stemming rather than after.
Remove dropped tokens rather than replace them with NA
.
Replace white-space in types with connector (_
).
Switch to "radix"
sort algorithm for consistent, fast term ordering on all platforms, regardless of locale.
Set combine = NULL
be default for text filters.
Make map_quote
only change apostrophe and single quote characters, not double quote.
Deprecate text_length()
function in favor of text_ntoken()
.
Removed deprecated functions abbreviations()
, as_corpus()
, as_text()
corpus()
, is_corpus()
, is_text()
, stopwords()
, term_frame()
.
Removed deprecated random
argument from text_locate()
.
New package website, http://corpustext.com
Add support for tm Corpus
and quanteda corpus
objects; all functions expecting text (text_tokens()
, term_matrix()
, etc.) should work seamlessly on these objects.
Add gutenberg_corpus()
for downloading a corpus from Project Gutenberg.
Add ...
arguments to all text functions, for overriding individual text_filter()
properties.
Add sentiment_afinn
, the AFINN sentiment lexicon.
Add text_sample()
for getting a random sample of term instances.
Add na.omit()
, na.exclude()
, na.fail()
implementations for corpus_frame
and corpus_text
.
Switch as_utf8()
default argument to normalize = FALSE
.
Re-order as_corpus_text()
and as_corpus_frame()
arguments; make both accept ...
arguments to override individual text filter properities.
Add missing single-letter initials to English abbreviation list.
Adaptively increase buffer size for read_ndjson()
so that large files can be read quickly.
Make summary()
on a corpus_text
object report statistics for the number of tokens and types.
Switch to 2-letter language codes for stemming algorithms.
Fix bug in utf8_normalize()
when the input contains a backslash (\
).
Fix bug in term_matrix()
column names (non-ASCII names were getting replaced by Unicode escapes).
Work around R Windows bug in converting native to UTF-8; described at https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 .
Make comparison operations on text vectors keep names if arguments have them.
Renamed corpus()
, as_corpus()
and is_corpus()
to corpus_frame()
, as_corpus_frame()
and is_corpus_frame()
to avoid name clashes with other packages.
Renamed as_text()
and is_text()
to as_corpus_text()
and is_corpus_text()
to avoid name clashes with other packages.
Rename term_frame()
to term_counts()
.
Deprecate text_locate()
random
argument; use text_sample()
instead.
Remove old deprecated term_counts()
function; use term_stats()
instead.
Deprecate abbreviations()
and stopwords()
function in favor of data objects: abbreviations_en
, stopwords_en
, stopwords_fr
, etc.
Fix buffer overrun for case folding some Greek letters.
Fix memory leak in read_ndjson()
.
Fix memory leak in JSON object deserialization.
Fix memory leak in term_stats()
.
Add corpus()
, as_corpus()
, is_corpus()
functions.
Make text_split()
split into evenly-sized blocks, with the ‘size’ argument specifying the maximum block size.
Added text_stats()
function.
Added text_match()
function to return matching terms as a factor.
Implemented text subset assignment operators [<-
and [[<-
.
Added utf8_normalize()
function for translating to NFC normal form, applying case and compatibility maps.
Added text_sub()
for getting token sub-sequences.
Added text_length()
for text length, including NA
tokens.
Add new vignette, “Introduction to corpus”.
Add random
argument to text_locate
for random order.
Change format.corpus_frame
to use elastic column widths for text.
Allow rows = -1
for print.corpus_frame
to print all rows.
Following quanteda, add “will” to the English stop word list.
Add special handling for hyphens so that, for example, “world-wide” is a single token (but “-world-wide-” is three tokens).
Merged “url” and “symbol” word categories. Removed “other” word category (ignore these characters).
Change stemmer so that it only modifies tokens of kind “letter”, preserving “number”, “symbol”, “url”, etc.
Switched to more efficient c.corpus_text()
function.
Make text_locate()
return “text” column as a factor.
Constrain text names()
to be unique, non-missing.
Added “names” argument to as_text()
for overriding default names.
Added checks for underflow in read_ndjson()
double deserialization.
Fixed bug in text_filter<-
where assignment did not make a deep copy of the object.
Fixed bug in utf8_format()
, utf8_print()
, utf8_width()
where internal double quotes were not escaped.
Fixed rchk, UBSAN warnings.
Renamed term_counts()
to term_stats()
.
Removed deprecated functions token_filter()
and sentence_filter()
.
Removed term
column from text_locate()
output.
Removed map_compat
option from ext_filter()
; use utf8_normalize()
instead if you need to apply compatibility maps.
Added text_filter()
generic.
Added text_filter()<-
setter for text vectors.
Use a text’s text_filter()
attribute as a default in all text_*
functions expecting a filter argument.
Added the Federalist Papers dataset (federalist
).
Added functions for validating and converting to UTF-8: as_utf8()
, utf8_valid()
.
Added functions for formatting and printing utf8 text: utf8_encode()
, utf8_format()
, utf8_print()
, utf8_valid()
, and utf8_width()
.
Handle @mentions, #hashtags, and URLs in word tokenizer.
term_counts()
now reports the support
for each term (the number of texts containing the term), and has options for restricting output by the minimum and maximum support.
Added new class corpus_frame
to support better printing of data frame objects: left-align text data, truncate output to screen width, display emoji on Mac OS. Use this class for all data frame return values.
Added a “unicode” vignette.
Converted the “chinese” demo to a vignette. Thanks to Will Lowe for contributing.
Make text_split()
and term_frame()
return parent text as a factor.
Remove stringsAsFactors
option from read_ndjson()
; deserialize all JSON string fields as character by default.
read_ndjson()
de-serializes zero-length arrays as integer()
, logical()
, etc. instead of as NULL
.
Allow user interrupts (control-C) in all long-running C computations.
read_ndjson()
would de-serialize a boolean null
as FALSE
instead of NA
.Add text_locate()
, for searching for specific terms in text, reporting the contexts of the occurrences (“Key words in context”).
Add text_count()
and text_detect()
for counting term occurrences or checking for a term’s existence in a text.
Add text_types()
and text_ntype()
for returning the unique types in a text, or counting types.
Add text_nsentence()
for counting sentences.
Add term_frame()
, reporting term frequencies as a data frame with columns "text"
, "term"
, and "count"
.
Add transpose argument to term_matrix()
.
Add new version of format.corpus_text()
that is faster and aware of character widths, in particular, Emoji and East Asian character widths.
Normalize token filter combine
, drop
, drop_except
, stem_except
arguments, to allow passing cased versions of these arguments.
Set combine = abbreviations("english")
by default.
Rename tokens()
to text_tokens()
for consistency; add text_ntoken()
.
Rename term_counts()
min
and max
arguments to min_count
and max_count
.
"u.s"
(a unigram) stems to "u.s"
(a bigram), and then causes for term_matrix()
select. Thanks to Dmitriy Selivanov for reporting: https://github.com/patperry/r-corpus/issues/3 .Add ngrams
options for term_counts()
and term_matrix()
.
Add sentence break suppressions (special handling for abbreviations); the default behavior for text_split(, "sentences")
is to use a set of English abbreviations as suppressions.
Add option to treat CR and LF like spaces when determining sentence boundaries; this is now the default.
Add term_counts()
min
and max
options for excluding terms with counts below or above specified limits.
Add term_counts()
limit
option to limit the number of reported terms.
Add term_counts()
types
option for reporting the types that make up a term.
Add abbreviations()
function with abbreviation lists for English, French, German, Italian, Portuguese, and Russian (from the Unicode Common Locale Data Repository).
Add more refined control over token_filter()
drop cateogries: merged "kana"
, and "ideo"
into "letter"
; split off "punct"
, "mark"
, and "other"
from "symbol"
.
Rename text_filter()
to token_filter()
.
Remove select
argument from token_filter()
, but add select
to term_matrix()
arguments.
Replace sentences()
function with text_split()
, which has options for breaking into multi-sentence blocks or multi-token blocks.
Remove remove_control
, map_dash
, and remove_space
type normalization options from text_filter()
.
Remove ignore_empty
token filter option.
Rename "text"
class to "corpus_text"
to avoid name classes with grid. Thanks to Jeroen Ooms for reporting: https://github.com/patperry/corpus/issues/1
Rename "jsondata"
to "corpus_json"
for consistency.
read_ndjson()
for reading factors with missing values.Add term_counts()
function to tabulate term frequencies.
Add term_matrix()
function to compute a term frequency matrix.
Add text_filter()
option (stem_except
) to exempt specific terms from stemming.
Add text_filter()
option (drop
) to drop specific terms, along with option (drop_except
) to exempt specific terms from dropping.
Add text_filter()
option (combine
) to combine multi-word phrases like “new york city” into a single term.
Add text_filter()
option (select
) to select specific terms (excluding all words that are not on this list).
Add stopwords()
function.
Make read_ndjson()
decode JSON strings as character or factor (according to whether stringsAsFactors
is TRUE
) except for fields named "text"
, which get decoded as text objects.
text_filter()
options fold_case
, fold_dash
, fold_quote
to map_case
, map_dash
, map_quote
.read_ndjson()
to read from connections, not just files, by reading the file contents into memory first. Use this by default instead of memory mapping.text_filter()
options drop_symbol
, drop_number
, drop_letter
, drop_kana
, and drop_ideo
; these options replace the matched tokens with NA
.text_filter()
option drop_empty
to ignore_empty
.Support for serializing dataset and text objects via readRDS()
and other native routines. Unfortunately, this support doesn’t come for free, and the objects take a little bit more memory.
Add support for stemming via the Snowball library.
More convenient interface for accessing JSON arrays.
Make read_ndjson()
return a data frame by default, not a "jsondata"
object.
Rename as.text()
/is.text()
to as_text()
/is_text()
; make as_text()
retain names, work on S3 objects.
Rename read_json()
to read_ndjson()
to not clash with jsonlite.
Rename "dataset"
type to "jsondata"
.