Skip to content

Releases: alteryx/woodwork

v0.6.0

04 Aug 19:05
b58eb3f
Compare
Choose a tag to compare

v0.6.0 Aug 4, 2021

  • Fixes
    • Fix bug in _infer_datetime_format with all np.nan input (#1089)
  • Changes
    • The criteria for categorical type inference have changed (#1065)
    • The meaning of both the categorical_threshold and numeric_categorical_threshold settings have changed (#1065)
    • Make sampling for type inference more consistent (#1083)
    • Accessor logic checking if Woodwork has been initialized moved to decorator (#1093)
  • Documentation Changes
    • Fix some release notes that ended up under the wrong release (#1082)
    • Add BooleanNullable and IntegerNullable types to the docs (#1085)
    • Add guide for saving and loading Woodwork DataFrames (#1066)
    • Add in-depth guide on logical types and semantic tags (#1086)
  • Testing Changes
    • Add additional reviewers to minimum and latest dependency checkers (#1070, #1073, #1077)
    • Update the sample_df fixture to have more logical_type coverage (#1058)

Thanks to the following people for contributing to this release:
@davesque, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd, @tuethan1999

Breaking Changes

  • #1065: The criteria for categorical type inference have changed. Relatedly, the meaning of both the categorical_threshold and numeric_categorical_threshold settings have changed. Now, a categorical match is signaled when a series either has the "categorical" pandas dtype or if the ratio of unique value count (nan excluded) and total value count (nan also excluded) is below or equal to some fraction. The value used for this fraction is set by the categorical_threshold setting which now has a default value of 0.2. If a fraction is set for the numeric_categorical_threshold setting, then series with either a float or integer dtype may be inferred as categorical by applying the same logic described above with the numeric_categorical_threshold fraction. Otherwise, the numeric_categorical_threshold setting defaults to None which indicates that series with a numerical type should not be inferred as categorical. Users who have overridden either the categorical_threshold or numeric_categorical_threshold settings will need to adjust their settings accordingly.
  • #1083: The process of sampling series for logical type inference was updated to be more consistent. Before, initial sampling for inference differed depending on collection type (pandas, dask, or koalas). Also, further randomized subsampling was performed in some cases during categorical inference and in every case during email inference regardless of collection type. Overall, the way sampling was done was inconsistent and unpredictable. Now, the first 100,000 records of a column are sampled for logical type inference regardless of collection type although only records from the first partition of a dask dataset will be used. Subsampling performed by the inference functions of individual types has been removed. The effect of these changes is that inferred types may now be different although in many cases they will be more correct.

v0.5.1

22 Jul 19:48
92d2971
Compare
Choose a tag to compare

v0.5.1 Jul 22, 2021

  • Enhancements
    • Store inferred datetime format on Datetime logical type instance (#1025)
    • Add support for automatically inferring the EmailAddress logical type (#1047)
    • Add feature origin attribute to schema (#1056)
    • Add ability to calculate outliers and the statistical info required for box and whisker plots to WoodworkColumnAccessor (#1048)
    • Add ability to change config settings in a with block with ww.config.with_options (#1062)
    • Raises warning and removes tags when user adds a column with index tags to DataFrame (#1035)
  • Changes
    • Entirely null columns are now inferred as the Unknown logical type (#1043)
    • Add helper functions that check for whether an object is a koalas/dask series or dataframe (#1055)
    • TableAccessor.select method will now maintain dataframe column ordering in TableSchema columns (#1052)
  • Documentation Changes
    • Add supported types to metadata docstring (#1049)

Thanks to the following people for contributing to this release:
@davesque, @frances-h, @jeff-hernandez, @simha104, @tamargrey, @thehomebrewnerd, @tuethan1999

v0.5.0

07 Jul 21:48
0d5c5c1
Compare
Choose a tag to compare

v0.5.0 Jul 7, 2021

  • Enhancements
    • Add support for numpy array inputs to Woodwork (#1023)
    • Add support for pandas.api.extensions.ExtensionArray inputs to Woodwork (#1026)
  • Fixes
    • Add input validation to ww.init_series (#1015)
  • Changes
    • Remove lines in LogicalType.transform that raise error if dtype conflicts (#1012)
    • Add infer_datetime_format param to speed up to_datetime calls (#1016)
    • The default logical type is now the Unknown type instead of the NaturalLanguage type (#992)
    • Add pandas 1.3.0 compatibility (#987)

Thanks to the following people for contributing to this release:
@jeff-hernandez, @simha104, @tamargrey, @thehomebrewnerd, @tuethan1999

v0.4.2

23 Jun 15:22
6485525
Compare
Choose a tag to compare

v0.4.2 Jun 23, 2021

  • Enhancements
    • Pass additional progress information in callback functions (#979)
    • Add the ability to generate optional extra stats with DataFrame.ww.describe_dict (#988)
    • Add option to read and write orc files (#997)
    • Retain schema when calling series.ww.to_frame() (#1004)
  • Fixes
    • Raise type conversion error in Datetime logical type (#1001)
    • Try collections.abc to avoid deprecation warning (#1010)
  • Changes
    • Remove make_index parameter from DataFrame.ww.init (#1000)
    • Remove version restriction for dask requirements (#998)
  • Documentation Changes
    • Add instructions for installing the update checker (#993)
    • Disable pdf format with documentation build (#1002)
    • Silence deprecation warnings in documentation build (#1008)
    • Temporarily remove update checker to fix docs warnings (#1011)
  • Testing Changes
    • Add env setting to update checker (#978, #994)

Breaking Changes

  • Progress callback functions parameters have changed and progress is now being reported in the units
    specified by the unit of measurement parameter instead of percentage of total. Progress callback
    functions now are expected to accept the following five parameters:

    • progress increment since last call
    • progress units complete so far
    • total units to complete
    • the progress unit of measurement
    • time elapsed since start of calculation
  • DataFrame.ww.init no longer accepts the make_index parameter

Thanks to the following people for contributing to this release:
@frances-h, @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd, @tuethan1999

v0.4.1

10 Jun 15:17
cbba41a
Compare
Choose a tag to compare

v0.4.1 Jun 9, 2021

  • Enhancements
    • Add concat_columns util function to concatenate multiple Woodwork objects into one, retaining typing information (#932)
    • Add option to pass progress callback function to mutual information functions (#958)
    • Add optional automatic update checker (#959, #970)
  • Fixes
    • Fix issue related to serialization/deserialization of data with whitespace and newline characters (#957)
    • Update to allow initializing a ColumnSchema object with an Ordinal logical type without order values (#972)
  • Changes
    • Change write_dataframe to only copy dataframe if it contains LatLong (#955)
  • Testing Changes
    • Fix bug in test_list_logical_types_default (#954)
    • Update minimum unit tests to run on all pull requests (#952)
    • Pass token to authorize uploading of codecov reports (#969)

Thanks to the following people for contributing to this release:
@frances-h, @gsheni, @tamargrey, @thehomebrewnerd

v0.4.0

26 May 18:02
6763b0a
Compare
Choose a tag to compare

v0.4.0 May 26, 2021

  • Enhancements
    • Add option to return TableSchema instead of DataFrame from table accessor select method (#916)
    • Add option to pass progress callback function to mutual information functions (#943)
  • Fixes
    • Fix bug when setting table name and metadata through accessor (#942)
    • Fix bug in which the dtype of category values were not restored properly on deserialization (#949)
  • Changes
    • Add logical type method to transform data (#915)
  • Testing Changes
    • Update when minimum unit tests will run to include minimum text files (#917)
    • Create separate workflows for each CI job (#919)

Thanks to the following people for contributing to this release:
@gsheni, @jeff-hernandez, @thehomebrewnerd, @tuethan1999

v0.3.1

12 May 19:10
0c21225
Compare
Choose a tag to compare

v0.3.1 May 12, 2021

  • Enhancements
    • Add deep parameter to Woodwork Accessor and Schema equality checks (#889)
    • Add support for reading from parquet files to woodwork.read_file (#909)
  • Changes
    • Remove command line functions for list logical and semantic tags (#891)
    • Keep index and time index tags for single column when selecting from a table (#888)
    • Update accessors to store weak reference to data (#894)
  • Documentation Changes
    • Update nbsphinx version to fix docs build issue (#911, #913)
  • Testing Changes
    • Use Minimum Dependency Generator GitHub Action and remove tools folder (#897)
    • Move all latest and minimum dependencies into 1 folder (#912)

Breaking Changes

  • The command line functions python -m woodwork list-logical-types and python -m woodwork list-semantic-tags
    no longer exist. Please call the underlying Python functions ww.list_logical_types() and
    ww.list_semantic_tags().

Thanks to the following people for contributing to this release: @gsheni, @jeff-hernandez, @tamargrey, @thehomebrewnerd

v0.3.0

03 May 21:19
24be7cc
Compare
Choose a tag to compare

v0.3.0 May 3, 2021

  • Enhancements
    • Add is_schema_valid and get_invalid_schema_message functions for checking schema validity (#834)
    • Add logical type for Age and AgeNullable (#849)
    • Add logical type for Address (#858)
    • Add generic to_disk function to save Woodwork schema and data (#872)
    • Add generic read_file function to read file as Woodwork DataFrame (#878)
  • Fixes
    • Raise error when a column is set as the index and time index (#859)
    • Allow NaNs in index for schema validation check (#862)
    • Fix bug where invalid casting to Boolean would not raise error (#863)
  • Changes
    • Consistently use ColumnNotPresentError for mismatches between user input and dataframe/schema columns (#837)
    • Raise custom WoodworkNotInitError when accessing Woodwork attributes before initialization (#838)
    • Remove check requiring Ordinal instance for initializing a ColumnSchema object (#870)
    • Increase koalas min version to 1.8.0 (#885)
  • Documentation Changes
    • Improve formatting of release notes (#874)
  • Testing Changes
    • Remove unnecessary argument in codecov upload job (#853)
    • Change from GitHub Token to regenerated GitHub PAT dependency checkers (#855)
    • Update README.md with non-nullable dtypes in code example (#856)

Breaking Changes

  • Woodwork tables can no longer be saved using to disk df.ww.to_csv, df.ww.to_pickle, or
    df.ww.to_parquet. Use df.ww.to_disk instead.
  • The read_csv function has been replaced by read_file.

Thanks to the following people for contributing to this release:
@frances-h, @gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd

v0.2.0

20 Apr 19:11
9c2e3a9
Compare
Choose a tag to compare

v0.2.0 April 20, 2021

  • Enhancements
    • Add validation control to WoodworkTableAccessor (#736)
    • Store make_index value on WoodworkTableAccessor (#780)
    • Add optional exclude parameter to WoodworkTableAccessor select method (#783)
    • Add validation control to deserialize.read_woodwork_table and ww.read_csv (#788)
    • Add WoodworkColumnAccessor.schema and handle copying column schema (#799)
    • Allow initializing a WoodworkColumnAccessor with a ColumnSchema (#814)
    • Add __repr__ to ColumnSchema (#817)
    • Add BooleanNullable and IntegerNullable logical types (#830)
    • Add validation control to WoodworkColumnAccessor (#833)
  • Changes
    • Rename FullName logical type to PersonFullName (#740)
    • Rename ZIPCode logical type to PostalCode (#741)
    • Fix issue with smart-open version 5.0.0 (#750, #758)
    • Update minimum scikit-learn version to 0.22 (#763)
    • Drop support for Python version 3.6 (#768)
    • Remove ColumnNameMismatchWarning (#777)
    • get_column_dict does not use standard tags by default (#782)
    • Make logical_type and name params to _get_column_dict optional (#786)
    • Rename Schema object and files to match new table-column schema structure (#789)
    • Store column typing information in a ColumnSchema object instead of a dictionary (#791)
    • TableSchema does not use standard tags by default (#806)
    • Store use_standard_tags on the ColumnSchema instead of the TableSchema (#809)
    • Move functions in column_schema.py to be methods on ColumnSchema (#829)
  • Documentation Changes
    • Update Pygments version requirement (#751)
    • Update spark config for docs build (#787, #801, #810)
  • Testing Changes
    • Add unit tests against minimum dependencies for python 3.6 on PRs and main (#743, #753, #763)
    • Update spark config for test fixtures (#787)
    • Separate latest unit tests into pandas, dask, koalas (#813)
    • Update latest dependency checker to generate separate core, koalas, and dask dependencies (#815, #825)
    • Ignore latest dependency branch when checking for updates to the release notes (#827)
    • Change from GitHub PAT to auto generated GitHub Token for dependency checker (#831)
    • Expand ColumnSchema semantic tag testing coverage and null logical_type testing coverage (#832)

Thanks to the following people for contributing to this release:
@gsheni, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd

Breaking Changes

  • The ZIPCode logical type has been renamed to PostalCode
  • The FullName logical type has been renamed to PersonFullName
  • The Schema object has been renamed to TableSchema
  • With the ColumnSchema object, typing information for a column can no longer be accessed
    with df.ww.columns[col_name]['logical_type']. Instead use df.ww.columns[col_name].logical_type.
  • The Boolean and Integer logical types will no longer work with data that contains null
    values. The new BooleanNullable and IntegerNullable logical types should be used if
    null values are present.

v0.1.0

22 Mar 20:40
2a28bfb
Compare
Choose a tag to compare

v0.1.0 March 22, 2021

  • Enhancements
    • Implement Schema and Accessor API (#497)
    • Add Schema class that holds typing info (#499)
    • Add WoodworkTableAccessor class that performs type inference and stores Schema (#514)
    • Allow initializing Accessor schema with a valid Schema object (#522)
    • Add ability to read in a csv and create a DataFrame with an initialized Woodwork Schema (#534)
    • Add ability to call pandas methods from Accessor (#538, #589)
    • Add helpers for checking if a column is one of Boolean, Datetime, numeric, or categorical (#553)
    • Add ability to load demo retail dataset with a Woodwork Accessor (#556)
    • Add select to WoodworkTableAccessor (#548)
    • Add mutual_information to WoodworkTableAccessor (#571)
    • Add WoodworkColumnAccessor class (#562)
    • Add semantic tag update methods to column accessor (#573)
    • Add describe and describe_dict to WoodworkTableAccessor (#579)
    • Add init_series util function for initializing a series with dtype change (#581)
    • Add set_logical_type method to WoodworkColumnAccessor (#590)
    • Add semantic tag update methods to table schema (#591)
    • Add warning if additional parameters are passed along with schema (#593)
    • Better warning when accessing column properties before init (#596)
    • Update column accessor to work with LatLong columns (#598)
    • Add set_index to WoodworkTableAccessor (#603)
    • Implement loc and iloc for WoodworkColumnAccessor (#613)
    • Add set_time_index to WoodworkTableAccessor (#612)
    • Implement loc and iloc for WoodworkTableAccessor (#618)
    • Allow updating logical types with set_types and make relevant DataFrame changes (#619)
    • Allow serialization of WoodworkColumnAccessor to csv, pickle, and parquet (#624)
    • Add DaskColumnAccessor (#625)
    • Allow deserialization from csv, pickle, and parquet to Woodwork table (#626)
    • Add value_counts to WoodworkTableAccessor (#632)
    • Add KoalasColumnAccessor (#634)
    • Add pop to WoodworkTableAccessor (#636)
    • Add drop to WoodworkTableAccessor (#640)
    • Add rename to WoodworkTableAccessor (#646)
    • Add DaskTableAccessor (#648)
    • Add Schema properties to WoodworkTableAccessor (#651)
    • Add KoalasTableAccessor (#652)
    • Adds __getitem__ to WoodworkTableAccessor (#633)
    • Update Koalas min version and add support for more new pandas dtypes with Koalas (#678)
    • Adds __setitem__ to WoodworkTableAccessor (#669)
  • Fixes
    • Create new Schema object when performing pandas operation on Accessors (#595)
    • Fix bug in _reset_semantic_tags causing columns to share same semantic tags set (#666)
    • Maintain column order in DataFrame and Woodwork repr (#677)
  • Changes
    • Move mutual information logic to statistics utils file (#584)
    • Bump min Koalas version to 1.4.0 (#638)
    • Preserve pandas underlying index when not creating a Woodwork index (#664)
    • Restrict Koalas version to <1.7.0 due to breaking changes (#674)
    • Clean up dtype usage across Woodwork (#682)
    • Improve error when calling accessor properties or methods before init (#683)
    • Remove dtype from Schema dictionary (#685)
    • Add include_index param and allow unique columns in Accessor mutual information (#699)
    • Include DataFrame equality and use_standard_tags in WoodworkTableAccessor equality check (#700)
    • Remove DataTable and DataColumn classes to migrate towards the accessor approach (#713)
    • Change sample_series dtype to not need conversion and remove convert_series util (#720)
    • Rename Accessor methods since DataTable has been removed (#723)
  • Documentation Changes
    • Update README.md and Get Started guide to use accessor (#655, #717)
    • Update Understanding Types and Tags guide to use accessor (#657)
    • Update docstrings and API Reference page (#660)
    • Update statistical insights guide to use accessor (#693)
    • Update Customizing Type Inference guide to use accessor (#696)
    • Update Dask and Koalas guide to use accessor (#701)
    • Update index notebook and install guide to use accessor (#715)
    • Add section to documentation about schema validity (#729)
    • Update README.md and Get Started guide to use pd.read_csv (#730)
    • Make small fixes to documentation formatting (#731)
  • Testing Changes
    • Add tests to Accessor/Schema that weren't previously covered (#712, #716)
    • Update release branch name in notes update check (#719)

Thanks to the following people for contributing to this release:
@gsheni, @jeff-hernandez, @johnbridstrup, @tamargrey, @thehomebrewnerd