-
Notifications
You must be signed in to change notification settings - Fork 76
Add jupyter notebooks to the repository #546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"The algorithm for DC verification can check whether a given DC holds in the table." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given exact DC
"\n", | ||
"¬{ t.State == s.State } → 0.25\n", | ||
"\n", | ||
"Note: A smaller $g_1$ value means fewer violations, making the DC more exact." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more close to exact
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"**Differential dependencies** seem to be complicated, but in fact, they are easy to understand. Let's try it out with [Desbordante](https://github.com/Desbordante/desbordante-core)!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try them
"Now let's move to the second DD: \"Distance [0, 50] -> Duration [0, 15]\". This DD means the following: for any\n", | ||
"pair of tuples if the distance between them on the column \"Distance\" is between 0 and 50, then the distance on\n", | ||
"the column \"Duration\" is between 0 and 15. In other words, if two flights have similar distances, then they\n", | ||
"last for a similar time. As can be seen from the table, almost all flights have similar distances which differ\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed this issue in voice chat (we are showing all suitable records for the first record); fix in example too?
"source": [ | ||
"If you are reading this, then you have learnt about differential dependencies. Not so difficult, after all, right?\n", | ||
"\n", | ||
"We have explored data and found insteresting patterns there:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insteresting -> interesting
"cell_type": "markdown", | ||
"source": [ | ||
"The number of constraints for each column can be different. The difference table can be accepted by the algorithm\n", | ||
"only in the format stated above. Note that different difference tables processed by the algorithm result in\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one will be better:
Please note that different difference tables fed into the algorithm result in different sections of the search space being explored and, thus, yield different results.
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"# Value cluster filtering parameters.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe somewhere at the start we need to put text cell which will say:
This collection of scenarios demonstrates how to solve various data quality problems by exploiting patterns found (or validated) by Desbordante.
In this scenario, we showcase a simple application that performs typo detection in a table.
The idea of this scenario is described in the paper "Solving Data Quality Problems with Desbordante: a Demo" by G. Chernishev et al., available at https://arxiv.org/abs/2307.14935. There is also an interactive demo at https://desbordante.streamlit.app/.
"id": "zRAc6mNW5T6_" | ||
}, | ||
"source": [ | ||
"## Setting up various algorithm parameters." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe somewhere at the start we need to put text cell which will say:
This collection of scenarios demonstrates how to solve various data quality problems by exploiting patterns found (or validated) by Desbordante.
In this scenario, we showcase a simple application that performs data deduplication in a table.
The idea of this scenario is described in the paper "Solving Data Quality Problems with Desbordante: a Demo" by G. Chernishev et al., available at https://arxiv.org/abs/2307.14935. There is also an interactive demo at https://desbordante.streamlit.app/.
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"## Setting up various algorithm parameters." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe somewhere at the start we need to put text cell which will say:
This collection of scenarios demonstrates how to solve various data quality problems by exploiting patterns found (or validated) by Desbordante.
In this scenario, we showcase a simple application that performs anomaly detection in a table.
The idea of this scenario is described in the paper "Solving Data Quality Problems with Desbordante: a Demo" by G. Chernishev et al., available at https://arxiv.org/abs/2307.14935. There is also an interactive demo at https://desbordante.streamlit.app/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good. Left several minor comments, mostly clarifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds the following jupyter notebooks: