-
Added alt text to figures in vignettes and README (#233)
-
Update vignette for quanteda::dfm() v4 (#242)
- Fixed bug for FREX stm tidier (#228)
- hunspell is now a suggested dependency, thanks to @MichaelChirico (#221)
- Added
stm()
tidiers for high FREX and lift words (#223) - Removed tweet-specific tokenizers because of changes in upstream dependencies (#227)
- Updated the tidy method for a quanteda
dfm
because of the upcoming release of Matrix (#218)
scale_x/y_reordered()
now uses a functionlabels
as its main input (#200)- Fixed how
to_lower
is passed to underlying tokenization function for character shingles (#208) - Added support for tidying STM models that use
content
, thanks to @jonathanvoelkle (#209)
- Update testing for rlang change + testthat 3e
- Check for installation of stopwords more gracefully
- Update tidiers and casters for new version of quanteda
- Use vdiffr conditionally
- Bug fix/breaking change for
collapse
argument tounnest_functions()
. This argument now takes eitherNULL
(do not collapse text across rows for tokenizing) or a character vector of variables (use said variables to collapse text across rows for tokenizing). This fixes a long-standing bug and provides more consistent behavior, but does change results for many situations (such as n-gram tokenization).
- Move one vignette to pkgdown site, because of dependency removal
- Move all CI from Travis to GH actions
reorder_within()
now handles multiple variables, thanks to @tmastny (#170)- Move stopwords to Suggests so tidytext can be installed on older versions of R
- Pass
to_lower
argument to other tokenizing functions, for more consistent behavior (#175) - Add
glance()
method for stm's estimated regressions, thanks to @vincentarelbundock (#176)
- Update tidying test for new tibble release (inner names for columns)
- Deprecate SE versions of main functions (have long been replaced by tidy eval semantics)
- Improve error handling throughout
- Wrapper tokenization functions for n-grams, characters, sentences, tweets, and more, thanks to @ColinFay (#137).
- Simplify get_sentiments() thanks to @jennybc (#151).
- Fix flaky tests for corpus tidiers.
- Access NRC lexicon via textdata package
- Fix bug in
augment()
function for stm topic model. - Warn when tf-idf is negative, thanks to @EmilHvitfeldt (#112).
- Switch from importing broom to importing generics, for lighter dependencies (#133).
- Add functions for reordering factors (such as for ggplot2 bar plots) thanks to @tmastny (#110).
- Update to
tibble()
where appropriate, thanks to @luisdza (#136). - Clarify documentation about impact of lowercase conversion on URLs (#139).
- Change how sentiment lexicons are accessed from package (remove NRC lexicon entirely, access AFINN and Loughran lexicons via textdata package so they are no longer included in this package).
- Improvements to documentation (#117)
- Fix for NSE thanks to @lepennec (#122).
- Tidier for estimated regressions from stm package thanks to @jefferickson (#115).
- Tidier for correlated topic model from topicmodels package (#123).
- Updates to documentation (#109) thanks to Emil Hvitfeldt.
- Add new tokenizers for tweets, Penn Treebank to
unnest_tokens()
. - Better error message (#111) and code styling.
- Declare dependency for tests.
- Updates to documentation (#102), README, and vignettes.
- Add tokenizing by character shingles thanks to Kanishka Misra (#105).
- Fix tests for skip grams thanks to Lincoln Mullen (#106).
- Updated more docs/tests so package can build on R-oldrel. (Still trying!)
unnest_tokens
can now unnest a data frame with a list column (which formerly threw the errorunnest_tokens expects all columns of input to be atomic vectors (not lists)
). The unnested result repeats the objects within each list. (It's still not possible whencollapse = TRUE
, in which tokens can span multiple lines).- Add
get_tidy_stopwords()
to obtain stopword lexicons in multiple languages in a tidy format. - Add a dataset
nma_words
of negators, modals, and adverbs that affect sentiment analysis (#55). - Updated various vignettes/docs/tests so package can build on R-oldrel.
- Change how
NA
values are handled inunnest_tokens
so they no longer cause other columns to becomeNA
(#82). - Update tidiers and casters to align with quanteda v1.0 (#87).
- Handle input/output object classes (such as
data.table
) consistently (#88).
- Fix tidier for quanteda dictionary for correct class (#71).
- Add a pkgdown site.
- Convert NSE from underscored function to tidyeval (
unnest_tokens
,bind_tf_idf
, all sparse casters) (#67, #74). - Added tidiers for topic models from the
stm
package (#51).
get_sentiments
now works regardless of whethertidytext
has been loaded or not (#50).unnest_tokens
now supports data.table objects (#37).- Fixed
to_lower
parameter inunnest_tokens
to work properly for all tokenizing options. - Updated
tidy.corpus
,glance.corpus
, tests, and vignette for changes to quanteda API - Removed the deprecated
pair_count
function, which is now in the in-development widyr package - Added tidiers for LDA models from the
mallet
package - Added the Loughran and McDonald dictionary of sentiment words specific to financial reports
unnest_tokens
preserves custom attributes of data frames and data.tables
- Updated DESCRIPTION to require purrr >= 0.1.1.
- Fixed
cast_sparse
,cast_dtm
, and other sparse casters to ignore groups in the input (#19) - Changed
unnest_tokens
so that it no longer uses tidyr's unnest, but rather a custom version that removes some overhead. In some experiments, this sped up unnest_tokens on large inputs by about 40%. This also moves tidyr from Imports to Suggests for now. unnest_tokens
now checks that there are no list columns in the input, and raises an error if present (since those cannot be unnested).- Added a
format
argument to unnest_tokens so that it can process html, xml, latex or man pages using the hunspell package, though only whentoken = "words"
. - Added a
get_sentiments
function that takes the name of a lexicon ("nrc", "bing", or "sentiment") and returns just that sentiment data frame (#25)
- Added documentation for n-grams, skip n-grams, and regex
- Added codecov and appveyor
- Added tidiers for LDA objects from topicmodels and a vignette on topic modeling
- Added function to calculate tf-idf of a tidy text dataset and a tf-idf vignette
- Fixed a bug when tidying by line/sentence/paragraph/regex and there are multiple non-text columns
- Fixed a bug when unnesting using n-grams and skip n-grams (entire text was not being collapsed)
- Added ability to pass a (custom tokenizing) function to token. Also added a collapse argument that makes the choice whether to combine lines before tokenizing explicit.
- Changed tidy.dictionary to return a tbl_df rather than a data.frame
- Updated
cast_sparse
to work with dplyr 0.5.0 - Deprecated the
pair_count
function, which has been moved topairwise_count
in the widyr package. This will be removed entirely in a future version.
- Initial release for text mining using tidy tools