Skip to content

Commit 59bca43

Browse files
Add column accessor attribute to check for nullable logical type (#1127)
* Add WoodworkColumnAccessor.nullable method and tests * Update docs and release notes * Minor refactors in tests * Remove testing of multiple collection types Testing all three collection types means we're basically just testing _get_valid_dtype again. We already know that works from other tests so we don't need to test it here. * Update release notes * Remove unsupported dtypes from _NULLABLE_PHYSICAL_TYPES * Add WoodworkColumnAccessor.nullable to API docs * Update nullability docs * Update docstring Co-authored-by: David Sanders <[email protected]>
1 parent 6eece87 commit 59bca43

File tree

6 files changed

+110
-4
lines changed

6 files changed

+110
-4
lines changed

docs/source/api_reference.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ WoodworkColumnAccessor
6060
WoodworkColumnAccessor.loc
6161
WoodworkColumnAccessor.logical_type
6262
WoodworkColumnAccessor.metadata
63+
WoodworkColumnAccessor.nullable
6364
WoodworkColumnAccessor.remove_semantic_tags
6465
WoodworkColumnAccessor.reset_semantic_tags
6566
WoodworkColumnAccessor.semantic_tags

docs/source/guides/logical_types_and_semantic_tags.ipynb

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@
132132
"When Woodwork's type inference does not return any LogicalTypes for a column, Woodwork will set the column's logical type as the default LogicalType, `Unknown`. A logical type being inferred as `Unknown` may be a good indicator that a more specific logical type can be chosen and set by the user.\n",
133133
"\n",
134134
"- **physical type**: `string`\n",
135+
"- **nullable**: yes\n",
135136
"\n",
136137
"Below is an example of a column for which no logical type is inferred, resulting in a Series with `Unknown` logical type. Looking at the contents of the Series, though, we can see that it contains country codes, so we set the logical type to `CountryCode`."
137138
]
@@ -178,13 +179,15 @@
178179
"\n",
179180
"- **physical type**: `int64`\n",
180181
"- **standard tags**: `{'numeric'}`\n",
182+
"- **nullable**: no\n",
181183
"\n",
182184
"##### AgeFractional\n",
183185
"\n",
184186
"Represents Logical Types that contain non-negative floating point numbers indicating a person’s age. May contain null values.\n",
185187
"\n",
186188
"- **physical type**: `float64`\n",
187189
"- **standard tags**: `{'numeric'}`\n",
190+
"- **nullable**: yes\n",
188191
"\n",
189192
"\n",
190193
"##### AgeNullable\n",
@@ -193,26 +196,30 @@
193196
"\n",
194197
"- **physical type**: `Int64`\n",
195198
"- **standard tags**: `{'numeric'}`\n",
199+
"- **nullable**: yes\n",
196200
"\n",
197201
"##### Double\n",
198202
"\n",
199203
"Represents Logical Types that contain positive and negative numbers, some of which include a fractional component.\n",
200204
"\n",
201205
"- **physical type**: `float64`\n",
202206
"- **standard tags**: `{'numeric'}`\n",
207+
"- **nullable**: yes\n",
203208
"\n",
204209
"##### Integer\n",
205210
"\n",
206211
"Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0).\n",
207212
"\n",
208213
"- **physical type**: `int64`\n",
209214
"- **standard tags**: `{'numeric'}`\n",
215+
"- **nullable**: no\n",
210216
"\n",
211217
"##### IntegerNullable \n",
212218
"Represents Logical Types that contain positive and negative numbers without a fractional component, including zero (0). May contain null values. \n",
213219
"\n",
214220
"- **physical type**: `Int64`\n",
215221
"- **standard tags**: `{'numeric'}`\n",
222+
"- **nullable**: yes\n",
216223
"\n",
217224
"\n",
218225
"\n",
@@ -252,6 +259,7 @@
252259
"Represents a Logical Type with few unique values relative to the size of the data.\n",
253260
"\n",
254261
"- **physical type**: `category`\n",
262+
"- **nullable**: yes\n",
255263
"- **inference**: Woodwork defines a threshold for percentage unique values relative to the size of the series below which a series will be considered categorical. See [setting config options guide](setting_config_options.ipynb#Categorical-Threshold) for more information on how to control this threshold.\n",
256264
"- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.\n",
257265
"\n",
@@ -269,6 +277,7 @@
269277
"Represents Logical Types that use the ISO-3166 standard country code to represent countries. ISO 3166-1 (countries) are supported. These codes should be in the Alpha-2 format.\n",
270278
"\n",
271279
"- **physical type**: `category`\n",
280+
"- **nullable**: yes\n",
272281
"- **standard tags**: `{'category'}`\n",
273282
"- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.\n",
274283
"\n",
@@ -279,6 +288,7 @@
279288
"A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers. \n",
280289
"\n",
281290
"- **physical type**: `category`\n",
291+
"- **nullable**: yes\n",
282292
"- **standard tags**: `{'category'}`\n",
283293
"- **parameters**:\n",
284294
" - `order` - the order of the ordinal values in the column from low to high\n",
@@ -298,6 +308,7 @@
298308
"Represents Logical Types that contain a series of postal codes for representing a group of addresses.\n",
299309
"\n",
300310
"- **physical type**: `category`\n",
311+
"- **nullable**: yes\n",
301312
"- **standard tags**: `{'category'}`\n",
302313
"- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.\n",
303314
"\n",
@@ -306,6 +317,7 @@
306317
"Represents Logical Types that use the ISO-3166 standard sub-region code to represent a portion of a larger geographic region. ISO 3166-2 (sub-regions) codes are supported. These codes should be in the Alpha-2 format.\n",
307318
"\n",
308319
"- **physical type**: `category`\n",
320+
"- **nullable**: yes\n",
309321
"- **standard tags**: `{'category'}`\n",
310322
"- **koalas note**: Koalas does not support the `category` dtype, so for Koalas DataFrames and Series, the `string` dtype will be used.\n",
311323
"\n",
@@ -348,16 +360,19 @@
348360
"Represents Logical Types that contain binary values indicating true/false.\n",
349361
"\n",
350362
"- **physical type**: `bool`\n",
363+
"- **nullable**: no\n",
351364
"\n",
352365
"##### BooleanNullable\n",
353366
"Represents Logical Types that contain binary values indicating true/false. May also contain null values.\n",
354367
"\n",
355368
"- **physical type**: `boolean`\n",
369+
"- **nullable**: yes\n",
356370
"\n",
357371
"##### Datetime\n",
358372
"A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers.\n",
359373
"\n",
360374
"- **physical type**: `datetime64[ns]`\n",
375+
"- **nullable**: yes\n",
361376
"- **transformation**: Will convert valid strings or numbers to pandas datetimes, and will parse more datetime formats with the use of the `datetime_format` parameter.\n",
362377
"- **parameters**:\n",
363378
" - `datetime_format` - the format of the datetimes in the column, ex: `'%Y-%m-%d'` vs `'%m-%d-%Y'`\n",
@@ -373,12 +388,14 @@
373388
"Represents Logical Types that contain email address values.\n",
374389
"\n",
375390
"- **physical type**: `string`\n",
391+
"- **nullable**: yes\n",
376392
"- **inference**: Uses an email address regex that, if the data matches, means that the column contains email addresses. To learn more about controling the regex used, see the [setting config options guide](setting_config_options.ipynb#Email-Inference-Regex).\n",
377393
"\n",
378394
"##### LatLong\n",
379395
"A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers. \n",
380396
"\n",
381397
"- **physical type**: `object`\n",
398+
"- **nullable**: yes\n",
382399
"- **transformation**: Will convert inputs into a tuple of floats. Any null values will be stored as `np.nan`\n",
383400
"- **koalas note**: Koalas does not support tuples, so latlongs will be stored as a list of floats\n",
384401
"\n",
@@ -387,6 +404,7 @@
387404
"Represents Logical Types that contain values specifying a duration of time.\n",
388405
"\n",
389406
"- **physical type**: `timedelta64[ns]`\n",
407+
"- **nullable**: yes\n",
390408
"\n",
391409
"\n",
392410
"Examples could inclue:\n",
@@ -457,6 +475,7 @@
457475
"Represents Logical Types that contain long-form text or characters representing natural human language\n",
458476
"\n",
459477
"- **physical type**: `string`\n",
478+
"- **nullable**: yes\n",
460479
"\n",
461480
"Examples of natural language data:\n",
462481
"\n",
@@ -469,39 +488,45 @@
469488
"Represents Logical Types that contain address values.\n",
470489
"\n",
471490
"- **physical type**: `string`\n",
491+
"- **nullable**: yes\n",
472492
"\n",
473493
"\n",
474494
"##### Filepath\n",
475495
"\n",
476496
"Represents Logical Types that specify locations of directories and files in a file system.\n",
477497
"\n",
478498
"- **physical type**: `string`\n",
499+
"- **nullable**: yes\n",
479500
"\n",
480501
"\n",
481502
"##### PersonFullName\n",
482503
"\n",
483504
"Represents Logical Types that may contain first, middle and last names, including honorifics and suffixes.\n",
484505
"\n",
485506
"- **physical type**: `string`\n",
507+
"- **nullable**: yes\n",
486508
"\n",
487509
"##### PhoneNumber\n",
488510
"\n",
489511
"Represents Logical Types that contain numeric digits and characters representing a phone number.\n",
490512
"\n",
491513
"- **physical type**: `string`\n",
514+
"- **nullable**: yes\n",
492515
"\n",
493516
"\n",
494517
"##### URL\n",
495518
"\n",
496519
"Represents Logical Types that contain URLs, which may include protocol, hostname and file name.\n",
497520
"\n",
498521
"- **physical type**: `string`\n",
522+
"- **nullable**: yes\n",
499523
"\n",
500524
"##### IPAddress\n",
501525
"\n",
502526
"Represents Logical Types that contain IP addresses, including both IPv4 and IPv6 addresses.\n",
503527
"\n",
504-
"- **physical type**: `string`\n"
528+
"- **physical type**: `string`\n",
529+
"- **nullable**: yes\n"
505530
]
506531
},
507532
{
@@ -676,6 +701,28 @@
676701
"\n",
677702
"In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall."
678703
]
704+
},
705+
{
706+
"cell_type": "markdown",
707+
"id": "sexual-adoption",
708+
"metadata": {},
709+
"source": [
710+
"## Checking for nullable logical types\n",
711+
"\n",
712+
"Some logical types support having null values in the underlying data while others do not. This is entirely based on whether a logical type's underlying primary dtype or backup dtype supports null values. For example, the `EmailAddress` logical type has an underlying primary dtype of `string`. Pandas allows series with the dtype `string` to contain null values marked by the `pandas.NA` sentinel. Therefore, `EmailAddress` supports null values. On the other hand, the `Integer` logical type does not support null values since its underlying primary pandas dtype is `int64`. Pandas does not allow null values in series with the dtype `int64`. However, pandas does allow null values in series with the dtype `Int64`. Therefore, the `IntegerNullable` logical type supports null values since its primary dtype is `Int64`.\n",
713+
"\n",
714+
"You can check if a column contains a nullable logical type by using `nullable` on the column accessor. The sections above that describe each type's characteristics include information about whether or not a logical type is nullable."
715+
]
716+
},
717+
{
718+
"cell_type": "code",
719+
"execution_count": null,
720+
"id": "surprised-today",
721+
"metadata": {},
722+
"outputs": [],
723+
"source": [
724+
"df.ww['bools_nullable'].ww.nullable"
725+
]
679726
}
680727
],
681728
"metadata": {
@@ -694,7 +741,7 @@
694741
"name": "python",
695742
"nbconvert_exporter": "python",
696743
"pygments_lexer": "ipython3",
697-
"version": "3.8.2"
744+
"version": "3.8.8"
698745
}
699746
},
700747
"nbformat": 4,

docs/source/release_notes.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Future Release
77
* Enhancements
88
* Add support for automatically inferring the ``URL`` and ``IPAddress`` logical types (:pr:`1122`, :pr:`1124`)
99
* Add ``get_valid_mi_columns`` method to list columns that have valid logical types for mutual information calculation (:pr:`1129`)
10+
* Add attribute to check if column has a nullable logical type (:pr:`1127`)
1011
* Fixes
1112
* Changes
1213
* Update ``get_invalid_schema_message`` to improve performance (:pr:`1132`)
@@ -15,7 +16,7 @@ Future Release
1516
* Testing Changes
1617

1718
Thanks to the following people for contributing to this release:
18-
:user:`ajaypallekonda`, :user:`thehomebrewnerd`
19+
:user:`ajaypallekonda`, :user:`davesque`, :user:`jeff-hernandez`, :user:`thehomebrewnerd`
1920

2021
v0.7.1 Aug 25, 2021
2122
===================

woodwork/column_accessor.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
TypingInfoMismatchWarning
1717
)
1818
from woodwork.indexers import _iLocIndexer, _locIndexer
19-
from woodwork.logical_types import LatLong, Ordinal
19+
from woodwork.logical_types import _NULLABLE_PHYSICAL_TYPES, LatLong, Ordinal
2020
from woodwork.statistics_utils import _get_box_plot_info_for_column
2121
from woodwork.table_schema import TableSchema
2222
from woodwork.utils import (
@@ -106,6 +106,13 @@ def _series(self):
106106
def schema(self):
107107
return copy.deepcopy(self._schema)
108108

109+
@property
110+
@_check_column_schema
111+
def nullable(self):
112+
"""Whether the column can contain null values."""
113+
dtype = self._schema.logical_type._get_valid_dtype(type(self._series))
114+
return dtype in _NULLABLE_PHYSICAL_TYPES
115+
109116
@property
110117
@_check_column_schema
111118
def description(self):

woodwork/logical_types.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -503,3 +503,23 @@ class PostalCode(LogicalType):
503503
primary_dtype = 'category'
504504
backup_dtype = 'string'
505505
standard_tags = {'category'}
506+
507+
508+
_NULLABLE_PHYSICAL_TYPES = {
509+
'boolean',
510+
'category',
511+
'datetime64[ns]',
512+
'Int8',
513+
'Int16',
514+
'Int32',
515+
'Int64',
516+
'Float32',
517+
'Float64',
518+
'float16',
519+
'float32',
520+
'float64',
521+
'float128',
522+
'object',
523+
'string',
524+
'timedelta64[ns]',
525+
}

woodwork/tests/accessor/test_column_accessor.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -833,3 +833,33 @@ def test_series_methods_returning_frame_no_name(sample_series):
833833
reset_index_frame = sample_series.ww.reset_index(drop=False)
834834
assert _is_dataframe(reset_index_frame)
835835
assert reset_index_frame.ww.schema is not None
836+
837+
838+
EXPECTED_COLUMN_NULLABILITIES = {
839+
'id': False,
840+
'full_name': True,
841+
'email': True,
842+
'phone_number': True,
843+
'age': True,
844+
'signup_date': True,
845+
'is_registered': True,
846+
'double': True,
847+
'double_with_nan': True,
848+
'integer': False,
849+
'nullable_integer': True,
850+
'boolean': False,
851+
'categorical': True,
852+
'datetime_with_NaT': True,
853+
'url': True,
854+
'ip_address': True
855+
}
856+
857+
858+
def test_nullable_attribute(sample_df_pandas):
859+
sample_df_pandas.ww.init()
860+
861+
for key in sample_df_pandas.ww.columns:
862+
actual = sample_df_pandas.ww[key].ww.nullable
863+
expected = EXPECTED_COLUMN_NULLABILITIES[key]
864+
865+
assert actual is expected

0 commit comments

Comments
 (0)