-
Notifications
You must be signed in to change notification settings - Fork 76
Add CFD validation #544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CFD validation #544
Conversation
most_frequent_rhs_[lhs_values] = std::max_element(rhs_count.begin(), rhs_count.end(), | ||
[](auto const& a, auto const& b) { | ||
return a.second < b.second; | ||
}) | ||
->first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO using ranges
here is cleaner, but current approach is good enough too.
most_frequent_rhs_[lhs_values] = std::max_element(rhs_count.begin(), rhs_count.end(), | |
[](auto const& a, auto const& b) { | |
return a.second < b.second; | |
}) | |
->first; | |
auto max_it = std::ranges::max_element(rhs_count, std::less{}, [](auto const& pair) { return pair.second; }); | |
most_frequent_rhs_[lhs_values] = max_it->first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
|
||
CFDStatsCalculator() = default; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone calls CalculateStatistics()
on that default-initialized object, relation_
is null.
You sure we need that default constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default constructor is necessary because the CFDVerifier
class needs to initialize its stats_calculator_
member during construction. The current code flow ensures safety because CalculateStatistics()
is only called after initialization through VerifyCFD()
.
bool satisfies = (rule_.second < 0 && row[rhs_attr_index_] == most_frequent_rhs) || | ||
(rule_.second > 0 && row[rhs_attr_index_] == rule_.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rule_.second
can't be zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the second part of the rule is a wildcard "__"
.
In the Itemset, "__"
is represented as a negative number of the form (-1 - attr_id) .
bool satisfies = (rule_.second < 0 && row[rhs_attr_index_] == most_frequent_rhs) || | ||
(rule_.second > 0 && row[rhs_attr_index_] == rule_.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rule_.second
can't be zero?
for (int attr_idx : lhs_attrs_) { | ||
lhs_values.push_back(row_values[attr_idx]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think insert
will be a bit more readable:
for (int attr_idx : lhs_attrs_) { | |
lhs_values.push_back(row_values[attr_idx]); | |
} | |
lhs_values.insert(lhs_values.end(), row_values.begin(), row_values.end()); |
This also makes reserve
unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! However, this change copies all values from row_values
, while the original code selects only those at indices in lhs_attrs_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, could you please mark resolved conversations on GitHub as 'Resolved'? It helps reviewers quickly identify which change requests were addressed and which may still need attention. Thanks!
CFDVerifierParams(std::vector<std::pair<std::string, std::string>> left, | ||
std::pair<std::string, std::string> right, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you include #include "algorithms/cfd/cfd_verifier/cfd_verifier.h"
, which has
using CFDAttributeValuePair = std::pair<std::string, std::string>;
consider using it here
This pull request introduces the Conditional Functional Dependencies (CFD) verification functionality. It includes new algorithms, and examples to demonstrate CFD validation. The most important changes include adding the CFD verifier algorithm and providing an example script for CFD verification.
New CFD Verification Functionality:
src/core/algorithms/cfd/cfd_verifier/cfd_verifier.h
: IntroducedCFDVerifier
class to handle CFD verification, including methods for loading data, executing verification, and calculating statistics.src/core/algorithms/cfd/cfd_verifier/cfd_stats_calculator.h
: AddedCFDStatsCalculator
class to compute support and confidence for CFD rules.src/core/algorithms/cfd/cfd_verifier/highlight.h
: DefinedHighlight
class to store clusters and violating rows for CFD rules.Python Bindings:
src/python_bindings/bindings.cpp
: Added bindings for CFD verification to the Python module.Example:
examples/basic/verifying_cfd.py
: Added a script to demonstrate how to verify CFDs using the Desbordante library, including loading data, defining CFD rules, executing verification, and printing results.