Skip to content

Add jupyter notebooks to the repository #546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 6, 2025

Conversation

MichaelS239
Copy link
Collaborator

This PR adds the following jupyter notebooks:

  • Newly created notebooks about the following primitives:
    • Differential dependencies
    • Matching dependencies
    • Denial constraints
    • Association rules
    • Numerical association rules
  • Updated notebooks with demo scenarios:
    • Typo detection
    • Data deduplication
    • Anomaly detection

{
"cell_type": "markdown",
"source": [
"The algorithm for DC verification can check whether a given DC holds in the table."
Copy link
Collaborator

@chernishev chernishev Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given exact DC

"\n",
"¬{ t.State == s.State } → 0.25\n",
"\n",
"Note: A smaller $g_1$ value means fewer violations, making the DC more exact."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more close to exact

{
"cell_type": "markdown",
"source": [
"**Differential dependencies** seem to be complicated, but in fact, they are easy to understand. Let's try it out with [Desbordante](https://github.com/Desbordante/desbordante-core)!"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try them

"Now let's move to the second DD: \"Distance [0, 50] -> Duration [0, 15]\". This DD means the following: for any\n",
"pair of tuples if the distance between them on the column \"Distance\" is between 0 and 50, then the distance on\n",
"the column \"Duration\" is between 0 and 15. In other words, if two flights have similar distances, then they\n",
"last for a similar time. As can be seen from the table, almost all flights have similar distances which differ\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed this issue in voice chat (we are showing all suitable records for the first record); fix in example too?

"source": [
"If you are reading this, then you have learnt about differential dependencies. Not so difficult, after all, right?\n",
"\n",
"We have explored data and found insteresting patterns there:\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insteresting -> interesting

"cell_type": "markdown",
"source": [
"The number of constraints for each column can be different. The difference table can be accepted by the algorithm\n",
"only in the format stated above. Note that different difference tables processed by the algorithm result in\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one will be better:

Please note that different difference tables fed into the algorithm result in different sections of the search space being explored and, thus, yield different results.

{
"cell_type": "code",
"source": [
"# Value cluster filtering parameters.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe somewhere at the start we need to put text cell which will say:

This collection of scenarios demonstrates how to solve various data quality problems by exploiting patterns found (or validated) by Desbordante.

In this scenario, we showcase a simple application that performs typo detection in a table.

The idea of this scenario is described in the paper "Solving Data Quality Problems with Desbordante: a Demo" by G. Chernishev et al., available at https://arxiv.org/abs/2307.14935. There is also an interactive demo at https://desbordante.streamlit.app/.

"id": "zRAc6mNW5T6_"
},
"source": [
"## Setting up various algorithm parameters."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe somewhere at the start we need to put text cell which will say:

This collection of scenarios demonstrates how to solve various data quality problems by exploiting patterns found (or validated) by Desbordante.

In this scenario, we showcase a simple application that performs data deduplication in a table.

The idea of this scenario is described in the paper "Solving Data Quality Problems with Desbordante: a Demo" by G. Chernishev et al., available at https://arxiv.org/abs/2307.14935. There is also an interactive demo at https://desbordante.streamlit.app/.

{
"cell_type": "markdown",
"source": [
"## Setting up various algorithm parameters."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe somewhere at the start we need to put text cell which will say:

This collection of scenarios demonstrates how to solve various data quality problems by exploiting patterns found (or validated) by Desbordante.

In this scenario, we showcase a simple application that performs anomaly detection in a table.

The idea of this scenario is described in the paper "Solving Data Quality Problems with Desbordante: a Demo" by G. Chernishev et al., available at https://arxiv.org/abs/2307.14935. There is also an interactive demo at https://desbordante.streamlit.app/.

Copy link
Collaborator

@chernishev chernishev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good. Left several minor comments, mostly clarifications.

Copy link
Collaborator

@chernishev chernishev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chernishev chernishev merged commit 2f83cdd into Desbordante:main Apr 6, 2025
16 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants