-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[BUGFIX] FIPS Compliance #11066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[BUGFIX] FIPS Compliance #11066
Conversation
|
Name | Link |
---|---|
🔨 Latest commit | d4521e2 |
A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GX in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution: Individual Contributor License Agreement v1.0 Once you have signed the CLA, you can add a comment with the text Please reach out to the #gx-community-support channel, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error. Users missing a CLA: [email protected] |
@cla-bot check |
A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GX in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution: Individual Contributor License Agreement v1.0 Once you have signed the CLA, you can add a comment with the text Please reach out to the #gx-community-support channel, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error. Users missing a CLA: [email protected] |
@cla-bot check |
A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GX in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution: Individual Contributor License Agreement v1.0 Once you have signed the CLA, you can add a comment with the text Please reach out to the #gx-community-support channel, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error. Users missing a CLA: [email protected] |
@ccravens Is there an issue with FIPS compliance software and having both an I haven't looked at the code itself yet and will look now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! Overall, these changes look fine. We should have the md5 and sha256 expectation live side-by-side since this would allow users to only use sha256 without breaking md5 users. If this is an issue for FIPS compliance, please let me know.
@@ -150,7 +150,7 @@ | |||
) | |||
from .expect_column_values_to_be_valid_mac import ExpectColumnValuesToBeValidMac | |||
from .expect_column_values_to_be_valid_mbti import ExpectColumnValuesToBeValidMbti | |||
from .expect_column_values_to_be_valid_md5 import ExpectColumnValuesToBeValidMd5 | |||
from .expect_column_values_to_be_valid_sha256 import ExpectColumnValuesToBeValidSha256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like you've renamed this file. Per the discussion comment I just left, we should have both expectations live side-by-side so don't break our current users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expectation should still be available to people who want to use it. It would not impact FIPS users since they wouldn't use this expectation. If you want to author an equivalent sha256 expectation I would do that in a separate PR.
For this PR I would revert this change.
| `sample_using_mod` | `column_name='col', mod=<int>` | Take the mod of named column, and only keep rows that match the given value | | ||
| `sample_using_a_list` | `column_name='col', value_list=<list[val]>` | Match the values in the named column against value_list, and only keep the matches | | ||
| `sample_using_hash` | `column_name='col', hash_digits=<int>, hash_value=<str>` | Hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value | | ||
| `sampling_method` | `sampling_kwargs` | Returned Batch Data | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is documentation for the 0.18 branch so should not be updated.
| `split_on_multi_column_values` | `column_names='<list[col]>'` | Rows where values of column_names are same | | ||
| `split_on_converted_datetime` | `column_name='col', date_format_string=<'%Y-%m-%d'>` | Rows where value of column_name converted to datetime using the given date_format_string are same | | ||
| `split_on_hashed_column` | `column_name='col', hash_digits=<int>` | Rows where value of column_name hashed (using "md5" hash function) and retaining the stated number of hash_digits are same (experimental) | | ||
| `splitter_method` | `splitter_kwargs` | Returned Batch Data | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also 0.18 documentation.
@@ -49,7 +49,7 @@ An Expectation can also expose Metrics, such as the observed value of a useful s | |||
|
|||
Validation Results can expose Metrics that are defined by specific Expectations that have been validated, called "Expectation Defined Metrics." To access those values, you address the Metric as a dot-delimited string that identifies the value, such as `expect_column_values_to_be_unique.success`or `expect_column_values_to_be_between.result.unexpected_percent`. These Metrics may be stored in a Metrics Store. | |||
|
|||
A `metric_kwargs_id` is a string representation of the Metric Kwargs that can be used as a database key. For simple cases, it could be easily readable, such as `column=Age`, but when there are multiple keys and values or complex values, it will most likely be a md5 hash of key/value pairs. It can also be `None` in the case that there are no kwargs required to identify the Metric. | |||
A `metric_kwargs_id` is a string representation of the Metric Kwargs that can be used as a database key. For simple cases, it could be easily readable, such as `column=Age`, but when there are multiple keys and values or complex values, it will most likely be a sha256 hash of key/value pairs. It can also be `None` in the case that there are no kwargs required to identify the Metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also 0.18 documentation.
Hello @billdirks thank you for your review on my PR. I think it should be OK for python if maybe we did some sort of "gate". Maybe we should wrap the hashlib.md5/sha256() functions with a custom that then does something like: def safeHash(string: str) -> str:
try:
return hashlib.md5(string.encode("utf-8")).hexdigest()
except ValueError as e:
logger.warning("MD5 is not allowed in FIPS mode, falling back to SHA-256")
return hashlib.sha256(string.encode("utf-8")).hexdigest() Python will throw an exception when you try and call md5() at runtime. Our best bet may be to introduce a wrapper function and then replace the (relatively few) occurrences of calling the hash function with the safe wrapper? Looking forward to your thoughts on this approach, thanks! |
@ccravens Why would |
Hello @billdirks great question! In FIPS-enabled systems only FIPS-approved algorithms are available. So when Python tries to invoke the MD5 algorithm it is not available on the system. So when
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ccravens, We've talked internally and getting fips compliance isn't currently high on our priority list so I'll have limited time to help you update PR. I have provided some feedback to point out things that will need to be updated as well as asked some questions inline.
@@ -290,7 +290,7 @@ The format for an unstructured profile is below: | |||
* CREDIT_CARD | |||
* EMAIL_ADDRESS | |||
* UUID | |||
* HASH_OR_KEY (md5, sha1, sha256, random hash, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this as is and revert this change.
Data Profilers are a legacy part of our codebase. Users can elect to use this or not and can elect which features they want. A FIPS user would not use the md5 but other users might find it useful for their datasets.
@@ -150,7 +150,7 @@ | |||
) | |||
from .expect_column_values_to_be_valid_mac import ExpectColumnValuesToBeValidMac | |||
from .expect_column_values_to_be_valid_mbti import ExpectColumnValuesToBeValidMbti | |||
from .expect_column_values_to_be_valid_md5 import ExpectColumnValuesToBeValidMd5 | |||
from .expect_column_values_to_be_valid_sha256 import ExpectColumnValuesToBeValidSha256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expectation should still be available to people who want to use it. It would not impact FIPS users since they wouldn't use this expectation. If you want to author an equivalent sha256 expectation I would do that in a separate PR.
For this PR I would revert this change.
@@ -7,22 +7,22 @@ | |||
column_condition_partial, | |||
) | |||
|
|||
MD5_REGEX = r"^([a-fA-F\d]{32})$" | |||
SHA256_REGEX = r"^([a-fA-F\d]{64})$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would revert changes to this file. It would not impact FIPS users.
|
||
|
||
def anonymize(string: str) -> str: | ||
return md5(string.encode("utf-8")).hexdigest() | ||
return sha256(string.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for sending analytic events and we would care if the output of this method changed. I would either:
- add a setting so this could be parameterized and default to md5.
- try md5 first and if it fails fallback to sha256.
@@ -29,7 +29,7 @@ def to_id(self, id_keys=None, id_ignore_keys=None) -> IDDictID: | |||
return f"{key}={self[key]!s}" | |||
|
|||
_id_dict = convert_to_json_serializable(data={k: self[k] for k in id_keys}) | |||
return hashlib.md5(json.dumps(_id_dict, sort_keys=True).encode("utf-8")).hexdigest() | |||
return hashlib.sha256(json.dumps(_id_dict, sort_keys=True).encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely fine though I'd want to review how this is used more. We could parameterize this by making a setting. We could share the setting with anonymize above.
@@ -300,7 +300,7 @@ def partition_on_hashed_column( | |||
"""Partition on the hashed value of the named column""" | |||
if self._dialect == GXSqlDialect.SQLITE: | |||
return ( | |||
sa.func.md5(sa.cast(sa.column(column_name), sa.VARCHAR), hash_digits) | |||
sa.func.sha256(sa.cast(sa.column(column_name), sa.VARCHAR), hash_digits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship of this hashing and the batch_identifiers[column_name]
? How is the hashed batch identifier specified?
@@ -215,11 +215,11 @@ def sample_using_a_list( | |||
value_list: list = self.get_sampling_kwargs_value_or_default(batch_spec, "value_list") | |||
return sa.column(column_name).in_(value_list) # type: ignore[return-value] # FIXME CoP | |||
|
|||
def sample_using_md5( | |||
def sample_using_sha256( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this being called anywhere. Can we just delete it?
@@ -348,9 +348,9 @@ def _add_sqlite_functions(connection): | |||
logger.info(f"Adding custom sqlite functions to connection {connection}") | |||
connection.create_function("sqrt", 1, lambda x: math.sqrt(x)) | |||
connection.create_function( | |||
"md5", | |||
"sha256", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this used? In the partitioner above?
""" # noqa: E501 # FIXME CoP | ||
if suffix: | ||
suffix = hashlib.md5(suffix.encode("utf-8")).hexdigest() | ||
suffix = hashlib.sha256(suffix.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this function being called anywhere. Can we delete?
|
||
|
||
def tuple_to_hash(tuple_): | ||
return md5(str(tuple_).encode("utf-8")).hexdigest() | ||
return sha256(str(tuple_).encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this function being called anywhere, can we delete?
This is the first commit to enable GX for FIPS compliance. The use of md5 does not work on FIPS environments (typical in Government / Healthcare / Finance industries). I've gone through and updated the use of md5 to sha256.
I ran
pytest --no-sqlalchemy
and the migration to FIPS introduced 4 additional failing tests out of around 1,000. I have resolved 2 of those by updating expected md5 results to their respective sha256, but the following 2 are still failing in this initial FIPS commit that did not fail for me in the develop branch:#11065