Skip to content

Add CFD validation #544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 16, 2025
Merged

Add CFD validation #544

merged 14 commits into from
May 16, 2025

Conversation

Godrik0
Copy link
Contributor

@Godrik0 Godrik0 commented Mar 12, 2025

This pull request introduces the Conditional Functional Dependencies (CFD) verification functionality. It includes new algorithms, and examples to demonstrate CFD validation. The most important changes include adding the CFD verifier algorithm and providing an example script for CFD verification.

New CFD Verification Functionality:

Python Bindings:

Example:

  • examples/basic/verifying_cfd.py: Added a script to demonstrate how to verify CFDs using the Desbordante library, including loading data, defining CFD rules, executing verification, and printing results.

Comment on lines 54 to 58
most_frequent_rhs_[lhs_values] = std::max_element(rhs_count.begin(), rhs_count.end(),
[](auto const& a, auto const& b) {
return a.second < b.second;
})
->first;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO using ranges here is cleaner, but current approach is good enough too.

Suggested change
most_frequent_rhs_[lhs_values] = std::max_element(rhs_count.begin(), rhs_count.end(),
[](auto const& a, auto const& b) {
return a.second < b.second;
})
->first;
auto max_it = std::ranges::max_element(rhs_count, std::less{}, [](auto const& pair) { return pair.second; });
most_frequent_rhs_[lhs_values] = max_it->first;

Copy link
Collaborator

@ol-imorozko ol-imorozko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

Comment on lines +43 to +45

CFDStatsCalculator() = default;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone calls CalculateStatistics() on that default-initialized object, relation_ is null.
You sure we need that default constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default constructor is necessary because the CFDVerifier class needs to initialize its stats_calculator_ member during construction. The current code flow ensures safety because CalculateStatistics() is only called after initialization through VerifyCFD().

Comment on lines +72 to +73
bool satisfies = (rule_.second < 0 && row[rhs_attr_index_] == most_frequent_rhs) ||
(rule_.second > 0 && row[rhs_attr_index_] == rule_.second);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule_.second can't be zero?

Copy link
Contributor Author

@Godrik0 Godrik0 May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the second part of the rule is a wildcard "__".
In the Itemset, "__" is represented as a negative number of the form (-1 - attr_id) .

Comment on lines +72 to +73
bool satisfies = (rule_.second < 0 && row[rhs_attr_index_] == most_frequent_rhs) ||
(rule_.second > 0 && row[rhs_attr_index_] == rule_.second);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule_.second can't be zero?

Comment on lines +30 to +32
for (int attr_idx : lhs_attrs_) {
lhs_values.push_back(row_values[attr_idx]);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think insert will be a bit more readable:

Suggested change
for (int attr_idx : lhs_attrs_) {
lhs_values.push_back(row_values[attr_idx]);
}
lhs_values.insert(lhs_values.end(), row_values.begin(), row_values.end());

This also makes reserve unneeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! However, this change copies all values from row_values, while the original code selects only those at indices in lhs_attrs_.

Copy link
Collaborator

@ol-imorozko ol-imorozko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, could you please mark resolved conversations on GitHub as 'Resolved'? It helps reviewers quickly identify which change requests were addressed and which may still need attention. Thanks!

Comment on lines +18 to +19
CFDVerifierParams(std::vector<std::pair<std::string, std::string>> left,
std::pair<std::string, std::string> right,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you include #include "algorithms/cfd/cfd_verifier/cfd_verifier.h", which has

using CFDAttributeValuePair = std::pair<std::string, std::string>;

consider using it here

@chernishev chernishev merged commit 2f2190e into Desbordante:main May 16, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants