Skip to content

[BUGFIX] FIPS Compliance #11066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

ccravens
Copy link

@ccravens ccravens commented Apr 7, 2025

This is the first commit to enable GX for FIPS compliance. The use of md5 does not work on FIPS environments (typical in Government / Healthcare / Finance industries). I've gone through and updated the use of md5 to sha256.

I ran pytest --no-sqlalchemy and the migration to FIPS introduced 4 additional failing tests out of around 1,000. I have resolved 2 of those by updating expected md5 results to their respective sha256, but the following 2 are still failing in this initial FIPS commit that did not fail for me in the develop branch:

FAILED tests/execution_engine/partition_and_sample/test_pandas_execution_engine_sampling.py::test_sample_using_sha256
FAILED tests/validator/test_validator.py::test_validator_with_exception_info_in_result

#11065

Copy link

netlify bot commented Apr 7, 2025

‼️ Deploy request for niobium-lead-7998 rejected.

Name Link
🔨 Latest commit d4521e2

@gx-cla-bot
Copy link

gx-cla-bot bot commented Apr 7, 2025

A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GX in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution:

Individual Contributor License Agreement v1.0
Software Grant and Corporate Contributor License Agreement v1.0

Once you have signed the CLA, you can add a comment with the text @cla-bot check and the bot will update the PR status!

Please reach out to the #gx-community-support channel, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error.

Users missing a CLA: [email protected]

@ccravens
Copy link
Author

ccravens commented Apr 7, 2025

@cla-bot check

@gx-cla-bot
Copy link

gx-cla-bot bot commented Apr 7, 2025

A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GX in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution:

Individual Contributor License Agreement v1.0
Software Grant and Corporate Contributor License Agreement v1.0

Once you have signed the CLA, you can add a comment with the text @cla-bot check and the bot will update the PR status!

Please reach out to the #gx-community-support channel, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error.

Users missing a CLA: [email protected]

@ccravens
Copy link
Author

ccravens commented Apr 7, 2025

@cla-bot check

@gx-cla-bot
Copy link

gx-cla-bot bot commented Apr 7, 2025

A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GX in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution:

Individual Contributor License Agreement v1.0
Software Grant and Corporate Contributor License Agreement v1.0

Once you have signed the CLA, you can add a comment with the text @cla-bot check and the bot will update the PR status!

Please reach out to the #gx-community-support channel, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error.

Users missing a CLA: [email protected]

@billdirks
Copy link
Contributor

@ccravens Is there an issue with FIPS compliance software and having both an ExpectColumnValuesToBeValidSha256 expectation and an ExpectColumnValuesToBeValidMd5 expectation? Eliminate the md5 version would be a breaking change and removing it in a minor or patch bump wouldn't be allowed in semantic versioning.
It would be easier to add a new expectation sha256 expectation by using the md5 one as a template and having both live in the codebase.

I haven't looked at the code itself yet and will look now.

Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! Overall, these changes look fine. We should have the md5 and sha256 expectation live side-by-side since this would allow users to only use sha256 without breaking md5 users. If this is an issue for FIPS compliance, please let me know.

@@ -150,7 +150,7 @@
)
from .expect_column_values_to_be_valid_mac import ExpectColumnValuesToBeValidMac
from .expect_column_values_to_be_valid_mbti import ExpectColumnValuesToBeValidMbti
from .expect_column_values_to_be_valid_md5 import ExpectColumnValuesToBeValidMd5
from .expect_column_values_to_be_valid_sha256 import ExpectColumnValuesToBeValidSha256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you've renamed this file. Per the discussion comment I just left, we should have both expectations live side-by-side so don't break our current users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expectation should still be available to people who want to use it. It would not impact FIPS users since they wouldn't use this expectation. If you want to author an equivalent sha256 expectation I would do that in a separate PR.

For this PR I would revert this change.

| `sample_using_mod` | `column_name='col', mod=<int>` | Take the mod of named column, and only keep rows that match the given value |
| `sample_using_a_list` | `column_name='col', value_list=<list[val]>` | Match the values in the named column against value_list, and only keep the matches |
| `sample_using_hash` | `column_name='col', hash_digits=<int>, hash_value=<str>` | Hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value |
| `sampling_method` | `sampling_kwargs` | Returned Batch Data |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is documentation for the 0.18 branch so should not be updated.

| `split_on_multi_column_values` | `column_names='<list[col]>'` | Rows where values of column_names are same |
| `split_on_converted_datetime` | `column_name='col', date_format_string=<'%Y-%m-%d'>` | Rows where value of column_name converted to datetime using the given date_format_string are same |
| `split_on_hashed_column` | `column_name='col', hash_digits=<int>` | Rows where value of column_name hashed (using "md5" hash function) and retaining the stated number of hash_digits are same (experimental) |
| `splitter_method` | `splitter_kwargs` | Returned Batch Data |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also 0.18 documentation.

@@ -49,7 +49,7 @@ An Expectation can also expose Metrics, such as the observed value of a useful s

Validation Results can expose Metrics that are defined by specific Expectations that have been validated, called "Expectation Defined Metrics." To access those values, you address the Metric as a dot-delimited string that identifies the value, such as `expect_column_values_to_be_unique.success`or `expect_column_values_to_be_between.result.unexpected_percent`. These Metrics may be stored in a Metrics Store.

A `metric_kwargs_id` is a string representation of the Metric Kwargs that can be used as a database key. For simple cases, it could be easily readable, such as `column=Age`, but when there are multiple keys and values or complex values, it will most likely be a md5 hash of key/value pairs. It can also be `None` in the case that there are no kwargs required to identify the Metric.
A `metric_kwargs_id` is a string representation of the Metric Kwargs that can be used as a database key. For simple cases, it could be easily readable, such as `column=Age`, but when there are multiple keys and values or complex values, it will most likely be a sha256 hash of key/value pairs. It can also be `None` in the case that there are no kwargs required to identify the Metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also 0.18 documentation.

@ccravens
Copy link
Author

@ccravens Is there an issue with FIPS compliance software and having both an ExpectColumnValuesToBeValidSha256 expectation and an ExpectColumnValuesToBeValidMd5 expectation? Eliminate the md5 version would be a breaking change and removing it in a minor or patch bump wouldn't be allowed in semantic versioning. It would be easier to add a new expectation sha256 expectation by using the md5 one as a template and having both live in the codebase.

I haven't looked at the code itself yet and will look now.

Hello @billdirks thank you for your review on my PR.

I think it should be OK for python if maybe we did some sort of "gate". Maybe we should wrap the hashlib.md5/sha256() functions with a custom that then does something like:

def safeHash(string: str) -> str:
    try:
        return hashlib.md5(string.encode("utf-8")).hexdigest()
    except ValueError as e:
        logger.warning("MD5 is not allowed in FIPS mode, falling back to SHA-256")
        return hashlib.sha256(string.encode("utf-8")).hexdigest()

Python will throw an exception when you try and call md5() at runtime. Our best bet may be to introduce a wrapper function and then replace the (relatively few) occurrences of calling the hash function with the safe wrapper? Looking forward to your thoughts on this approach, thanks!

@billdirks
Copy link
Contributor

@ccravens Why would hashlib.md5 raise in your safeHash function? I was thinking this was an audit process but rereading your PR description it seems like there are special FIPS compliant versions of python that enforces no md5. Is that correct? Is there a public FIP python ersion (if so, could you link one?) or are there bespoke versions that different organizations create?

@ccravens
Copy link
Author

@ccravens Why would hashlib.md5 raise in your safeHash function? I was thinking this was an audit process but rereading your PR description it seems like there are special FIPS compliant versions of python that enforces no md5. Is that correct? Is there a public FIP python ersion (if so, could you link one?) or are there bespoke versions that different organizations create?

Hello @billdirks great question! In FIPS-enabled systems only FIPS-approved algorithms are available. So when Python tries to invoke the MD5 algorithm it is not available on the system. So when hashlib.md5() is called, Python throws an exception like:

ValueError: [digital envelope routines] disabled for FIPS

Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ccravens, We've talked internally and getting fips compliance isn't currently high on our priority list so I'll have limited time to help you update PR. I have provided some feedback to point out things that will need to be updated as well as asked some questions inline.

@@ -290,7 +290,7 @@ The format for an unstructured profile is below:
* CREDIT_CARD
* EMAIL_ADDRESS
* UUID
* HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this as is and revert this change.
Data Profilers are a legacy part of our codebase. Users can elect to use this or not and can elect which features they want. A FIPS user would not use the md5 but other users might find it useful for their datasets.

@@ -150,7 +150,7 @@
)
from .expect_column_values_to_be_valid_mac import ExpectColumnValuesToBeValidMac
from .expect_column_values_to_be_valid_mbti import ExpectColumnValuesToBeValidMbti
from .expect_column_values_to_be_valid_md5 import ExpectColumnValuesToBeValidMd5
from .expect_column_values_to_be_valid_sha256 import ExpectColumnValuesToBeValidSha256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expectation should still be available to people who want to use it. It would not impact FIPS users since they wouldn't use this expectation. If you want to author an equivalent sha256 expectation I would do that in a separate PR.

For this PR I would revert this change.

@@ -7,22 +7,22 @@
column_condition_partial,
)

MD5_REGEX = r"^([a-fA-F\d]{32})$"
SHA256_REGEX = r"^([a-fA-F\d]{64})$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert changes to this file. It would not impact FIPS users.



def anonymize(string: str) -> str:
return md5(string.encode("utf-8")).hexdigest()
return sha256(string.encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for sending analytic events and we would care if the output of this method changed. I would either:

  1. add a setting so this could be parameterized and default to md5.
  2. try md5 first and if it fails fallback to sha256.

@@ -29,7 +29,7 @@ def to_id(self, id_keys=None, id_ignore_keys=None) -> IDDictID:
return f"{key}={self[key]!s}"

_id_dict = convert_to_json_serializable(data={k: self[k] for k in id_keys})
return hashlib.md5(json.dumps(_id_dict, sort_keys=True).encode("utf-8")).hexdigest()
return hashlib.sha256(json.dumps(_id_dict, sort_keys=True).encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely fine though I'd want to review how this is used more. We could parameterize this by making a setting. We could share the setting with anonymize above.

@@ -300,7 +300,7 @@ def partition_on_hashed_column(
"""Partition on the hashed value of the named column"""
if self._dialect == GXSqlDialect.SQLITE:
return (
sa.func.md5(sa.cast(sa.column(column_name), sa.VARCHAR), hash_digits)
sa.func.sha256(sa.cast(sa.column(column_name), sa.VARCHAR), hash_digits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship of this hashing and the batch_identifiers[column_name]? How is the hashed batch identifier specified?

@@ -215,11 +215,11 @@ def sample_using_a_list(
value_list: list = self.get_sampling_kwargs_value_or_default(batch_spec, "value_list")
return sa.column(column_name).in_(value_list) # type: ignore[return-value] # FIXME CoP

def sample_using_md5(
def sample_using_sha256(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this being called anywhere. Can we just delete it?

@@ -348,9 +348,9 @@ def _add_sqlite_functions(connection):
logger.info(f"Adding custom sqlite functions to connection {connection}")
connection.create_function("sqrt", 1, lambda x: math.sqrt(x))
connection.create_function(
"md5",
"sha256",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used? In the partitioner above?

""" # noqa: E501 # FIXME CoP
if suffix:
suffix = hashlib.md5(suffix.encode("utf-8")).hexdigest()
suffix = hashlib.sha256(suffix.encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function being called anywhere. Can we delete?



def tuple_to_hash(tuple_):
return md5(str(tuple_).encode("utf-8")).hexdigest()
return sha256(str(tuple_).encode("utf-8")).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function being called anywhere, can we delete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants