Releases: alteryx/woodwork
v0.6.0
v0.6.0 Aug 4, 2021
- Fixes
- Fix bug in
_infer_datetime_format
with allnp.nan
input (#1089)
- Fix bug in
- Changes
- The criteria for categorical type inference have changed (#1065)
- The meaning of both the
categorical_threshold
andnumeric_categorical_threshold
settings have changed (#1065) - Make sampling for type inference more consistent (#1083)
- Accessor logic checking if Woodwork has been initialized moved to decorator (#1093)
- Documentation Changes
- Testing Changes
Thanks to the following people for contributing to this release:
@davesque, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd, @tuethan1999
Breaking Changes
- #1065: The criteria for categorical type inference have changed. Relatedly, the meaning of both the
categorical_threshold
andnumeric_categorical_threshold
settings have changed. Now, a categorical match is signaled when a series either has the "categorical" pandas dtype or if the ratio of unique value count (nan excluded) and total value count (nan also excluded) is below or equal to some fraction. The value used for this fraction is set by thecategorical_threshold
setting which now has a default value of0.2
. If a fraction is set for thenumeric_categorical_threshold
setting, then series with either a float or integer dtype may be inferred as categorical by applying the same logic described above with thenumeric_categorical_threshold
fraction. Otherwise, thenumeric_categorical_threshold
setting defaults toNone
which indicates that series with a numerical type should not be inferred as categorical. Users who have overridden either thecategorical_threshold
ornumeric_categorical_threshold
settings will need to adjust their settings accordingly. - #1083: The process of sampling series for logical type inference was updated to be more consistent. Before, initial sampling for inference differed depending on collection type (pandas, dask, or koalas). Also, further randomized subsampling was performed in some cases during categorical inference and in every case during email inference regardless of collection type. Overall, the way sampling was done was inconsistent and unpredictable. Now, the first 100,000 records of a column are sampled for logical type inference regardless of collection type although only records from the first partition of a dask dataset will be used. Subsampling performed by the inference functions of individual types has been removed. The effect of these changes is that inferred types may now be different although in many cases they will be more correct.
v0.5.1
v0.5.1 Jul 22, 2021
- Enhancements
- Store inferred datetime format on Datetime logical type instance (#1025)
- Add support for automatically inferring the
EmailAddress
logical type (#1047) - Add feature origin attribute to schema (#1056)
- Add ability to calculate outliers and the statistical info required for box and whisker plots to
WoodworkColumnAccessor
(#1048) - Add ability to change config settings in a with block with
ww.config.with_options
(#1062) - Raises warning and removes tags when user adds a column with index tags to DataFrame (#1035)
- Changes
- Documentation Changes
- Add supported types to metadata docstring (#1049)
Thanks to the following people for contributing to this release:
@davesque, @frances-h, @jeff-hernandez, @simha104, @tamargrey, @thehomebrewnerd, @tuethan1999
v0.5.0
v0.5.0 Jul 7, 2021
- Enhancements
- Fixes
- Add input validation to ww.init_series (#1015)
- Changes
Thanks to the following people for contributing to this release:
@jeff-hernandez, @simha104, @tamargrey, @thehomebrewnerd, @tuethan1999
v0.4.2
v0.4.2 Jun 23, 2021
- Enhancements
- Fixes
- Changes
- Documentation Changes
- Testing Changes
Breaking Changes
-
Progress callback functions parameters have changed and progress is now being reported in the units
specified by the unit of measurement parameter instead of percentage of total. Progress callback
functions now are expected to accept the following five parameters:- progress increment since last call
- progress units complete so far
- total units to complete
- the progress unit of measurement
- time elapsed since start of calculation
-
DataFrame.ww.init
no longer accepts the make_index parameter
Thanks to the following people for contributing to this release:
@frances-h, @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd, @tuethan1999
v0.4.1
v0.4.1 Jun 9, 2021
- Enhancements
- Fixes
- Changes
- Change write_dataframe to only copy dataframe if it contains LatLong (#955)
- Testing Changes
Thanks to the following people for contributing to this release:
@frances-h, @gsheni, @tamargrey, @thehomebrewnerd
v0.4.0
v0.4.0 May 26, 2021
- Enhancements
- Fixes
- Changes
- Add logical type method to transform data (#915)
- Testing Changes
Thanks to the following people for contributing to this release:
@gsheni, @jeff-hernandez, @thehomebrewnerd, @tuethan1999
v0.3.1
v0.3.1 May 12, 2021
- Enhancements
- Changes
- Documentation Changes
- Testing Changes
Breaking Changes
- The command line functions
python -m woodwork list-logical-types
andpython -m woodwork list-semantic-tags
no longer exist. Please call the underlying Python functionsww.list_logical_types()
and
ww.list_semantic_tags()
.
Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd
v0.3.0
v0.3.0 May 3, 2021
- Enhancements
- Add
is_schema_valid
andget_invalid_schema_message
functions for checking schema validity (#834) - Add logical type for
Age
andAgeNullable
(#849) - Add logical type for
Address
(#858) - Add generic
to_disk
function to save Woodwork schema and data (#872) - Add generic
read_file
function to read file as Woodwork DataFrame (#878)
- Add
- Fixes
- Changes
- Consistently use
ColumnNotPresentError
for mismatches between user input and dataframe/schema columns (#837) - Raise custom
WoodworkNotInitError
when accessing Woodwork attributes before initialization (#838) - Remove check requiring
Ordinal
instance for initializing aColumnSchema
object (#870) - Increase koalas min version to 1.8.0 (#885)
- Consistently use
- Documentation Changes
- Improve formatting of release notes (#874)
- Testing Changes
Breaking Changes
- Woodwork tables can no longer be saved using to disk
df.ww.to_csv
,df.ww.to_pickle
, or
df.ww.to_parquet
. Usedf.ww.to_disk
instead. - The
read_csv
function has been replaced byread_file
.
Thanks to the following people for contributing to this release:
@frances-h, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd
v0.2.0
v0.2.0 April 20, 2021
- Enhancements
- Add validation control to WoodworkTableAccessor (#736)
- Store
make_index
value on WoodworkTableAccessor (#780) - Add optional
exclude
parameter to WoodworkTableAccessorselect
method (#783) - Add validation control to
deserialize.read_woodwork_table
andww.read_csv
(#788) - Add
WoodworkColumnAccessor.schema
and handle copying column schema (#799) - Allow initializing a
WoodworkColumnAccessor
with aColumnSchema
(#814) - Add
__repr__
toColumnSchema
(#817) - Add
BooleanNullable
andIntegerNullable
logical types (#830) - Add validation control to
WoodworkColumnAccessor
(#833)
- Changes
- Rename
FullName
logical type toPersonFullName
(#740) - Rename
ZIPCode
logical type toPostalCode
(#741) - Fix issue with smart-open version 5.0.0 (#750, #758)
- Update minimum scikit-learn version to 0.22 (#763)
- Drop support for Python version 3.6 (#768)
- Remove
ColumnNameMismatchWarning
(#777) get_column_dict
does not use standard tags by default (#782)- Make
logical_type
andname
params to_get_column_dict
optional (#786) - Rename Schema object and files to match new table-column schema structure (#789)
- Store column typing information in a
ColumnSchema
object instead of a dictionary (#791) TableSchema
does not use standard tags by default (#806)- Store
use_standard_tags
on theColumnSchema
instead of theTableSchema
(#809) - Move functions in
column_schema.py
to be methods onColumnSchema
(#829)
- Rename
- Documentation Changes
- Testing Changes
- Add unit tests against minimum dependencies for python 3.6 on PRs and main (#743, #753, #763)
- Update spark config for test fixtures (#787)
- Separate latest unit tests into pandas, dask, koalas (#813)
- Update latest dependency checker to generate separate core, koalas, and dask dependencies (#815, #825)
- Ignore latest dependency branch when checking for updates to the release notes (#827)
- Change from GitHub PAT to auto generated GitHub Token for dependency checker (#831)
- Expand
ColumnSchema
semantic tag testing coverage and nulllogical_type
testing coverage (#832)
Thanks to the following people for contributing to this release:
@gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd
Breaking Changes
- The
ZIPCode
logical type has been renamed toPostalCode
- The
FullName
logical type has been renamed toPersonFullName
- The
Schema
object has been renamed toTableSchema
- With the
ColumnSchema
object, typing information for a column can no longer be accessed
withdf.ww.columns[col_name]['logical_type']
. Instead usedf.ww.columns[col_name].logical_type
. - The
Boolean
andInteger
logical types will no longer work with data that contains null
values. The newBooleanNullable
andIntegerNullable
logical types should be used if
null values are present.
v0.1.0
v0.1.0 March 22, 2021
- Enhancements
- Implement Schema and Accessor API (#497)
- Add Schema class that holds typing info (#499)
- Add WoodworkTableAccessor class that performs type inference and stores Schema (#514)
- Allow initializing Accessor schema with a valid Schema object (#522)
- Add ability to read in a csv and create a DataFrame with an initialized Woodwork Schema (#534)
- Add ability to call pandas methods from Accessor (#538, #589)
- Add helpers for checking if a column is one of Boolean, Datetime, numeric, or categorical (#553)
- Add ability to load demo retail dataset with a Woodwork Accessor (#556)
- Add
select
to WoodworkTableAccessor (#548) - Add
mutual_information
to WoodworkTableAccessor (#571) - Add WoodworkColumnAccessor class (#562)
- Add semantic tag update methods to column accessor (#573)
- Add
describe
anddescribe_dict
to WoodworkTableAccessor (#579) - Add
init_series
util function for initializing a series with dtype change (#581) - Add
set_logical_type
method to WoodworkColumnAccessor (#590) - Add semantic tag update methods to table schema (#591)
- Add warning if additional parameters are passed along with schema (#593)
- Better warning when accessing column properties before init (#596)
- Update column accessor to work with LatLong columns (#598)
- Add
set_index
to WoodworkTableAccessor (#603) - Implement
loc
andiloc
for WoodworkColumnAccessor (#613) - Add
set_time_index
to WoodworkTableAccessor (#612) - Implement
loc
andiloc
for WoodworkTableAccessor (#618) - Allow updating logical types with
set_types
and make relevant DataFrame changes (#619) - Allow serialization of WoodworkColumnAccessor to csv, pickle, and parquet (#624)
- Add DaskColumnAccessor (#625)
- Allow deserialization from csv, pickle, and parquet to Woodwork table (#626)
- Add
value_counts
to WoodworkTableAccessor (#632) - Add KoalasColumnAccessor (#634)
- Add
pop
to WoodworkTableAccessor (#636) - Add
drop
to WoodworkTableAccessor (#640) - Add
rename
to WoodworkTableAccessor (#646) - Add DaskTableAccessor (#648)
- Add Schema properties to WoodworkTableAccessor (#651)
- Add KoalasTableAccessor (#652)
- Adds
__getitem__
to WoodworkTableAccessor (#633) - Update Koalas min version and add support for more new pandas dtypes with Koalas (#678)
- Adds
__setitem__
to WoodworkTableAccessor (#669)
- Fixes
- Changes
- Move mutual information logic to statistics utils file (#584)
- Bump min Koalas version to 1.4.0 (#638)
- Preserve pandas underlying index when not creating a Woodwork index (#664)
- Restrict Koalas version to
<1.7.0
due to breaking changes (#674) - Clean up dtype usage across Woodwork (#682)
- Improve error when calling accessor properties or methods before init (#683)
- Remove dtype from Schema dictionary (#685)
- Add
include_index
param and allow unique columns in Accessor mutual information (#699) - Include DataFrame equality and
use_standard_tags
in WoodworkTableAccessor equality check (#700) - Remove
DataTable
andDataColumn
classes to migrate towards the accessor approach (#713) - Change
sample_series
dtype to not need conversion and removeconvert_series
util (#720) - Rename Accessor methods since
DataTable
has been removed (#723)
- Documentation Changes
- Update README.md and Get Started guide to use accessor (#655, #717)
- Update Understanding Types and Tags guide to use accessor (#657)
- Update docstrings and API Reference page (#660)
- Update statistical insights guide to use accessor (#693)
- Update Customizing Type Inference guide to use accessor (#696)
- Update Dask and Koalas guide to use accessor (#701)
- Update index notebook and install guide to use accessor (#715)
- Add section to documentation about schema validity (#729)
- Update README.md and Get Started guide to use
pd.read_csv
(#730) - Make small fixes to documentation formatting (#731)
- Testing Changes
Thanks to the following people for contributing to this release:
@gsheni, @jeff-hernandez, @johnbridstrup, @tamargrey, @thehomebrewnerd