Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of primaryKey and foreignKeys when field is renamed #1560

Closed
fjuniorr opened this issue Jul 10, 2023 · 4 comments · Fixed by #1564 or #1577 · May be fixed by #1598
Closed

Update of primaryKey and foreignKeys when field is renamed #1560

fjuniorr opened this issue Jul 10, 2023 · 4 comments · Fixed by #1564 or #1577 · May be fixed by #1598
Assignees
Labels
bug Something isn't working

Comments

@fjuniorr
Copy link
Contributor

Currently, when a field is renamed with steps.field_update shouldn't other properties that rely on field.name be updated as well? The examples that came to mind are primaryKey and foreignKeys (if inside a package).

Otherwise frictionless transform could take a valid resource and generate an invalid resource as output. Here's one example:

from frictionless import Resource, Pipeline, steps, Schema

source = Resource("https://raw.githubusercontent.com/frictionlessdata/framework/main/data/transform.csv")
source.schema = Schema.from_descriptor({
    "fields": [
        {"name": "id", "type": "integer"},
        {"name": "name", "type": "string"},
        {"name": "population", "type": "integer"},
    ],
    "primaryKey": ["id"]
})

source.validate()

pipeline = Pipeline(
    steps=[
        steps.field_update(name="id", descriptor={"name": "pkey"}),
    ],
)
target = source.transform(pipeline)
target.validate()
{'valid': False,
 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.001},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'transform',
            'type': 'table',
            'valid': False,
            'place': '<memory>',
            'labels': [],
            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.001},
            'warnings': [],
            'errors': [{'type': 'schema-error',
                        'title': 'Schema Error',
                        'description': 'Provided schema is not valid.',
                        'message': 'Schema is not valid: primary key '
                                   '"[\'id\']" does not match the fields '
                                   '"[\'pkey\', \'name\', \'population\']"',
                        'tags': [],
                        'note': 'primary key "[\'id\']" does not match the '
                                'fields "[\'pkey\', \'name\', '
                                '\'population\']"'}]}]}
@fjuniorr
Copy link
Contributor Author

Otherwise frictionless transform could take a valid resource and generate an invalid resource as output.

Which in a way goes against the transform principle that metadata matters:

There are plenty of great ETL-frameworks written in Python and other languages. We use one of them (PETL) under the hood (described in more detail later). The core difference between Frictionless and others is that we treat metadata as a first-class citizen. This means that you don't lose type and other important information during the pipeline evaluation.

@shashigharti
Copy link
Contributor

Thanks! @fjuniorr for reporting.

@roll roll added the bug Something isn't working label Jul 11, 2023
@roll roll moved this to Current in Open Knowledge Jul 11, 2023
@github-project-automation github-project-automation bot moved this from Current to Done in Open Knowledge Jul 13, 2023
@fjuniorr
Copy link
Contributor Author

Thanks @shashigharti and @roll!

Here a couple more examples that I would hope would work regarding foreignKeys.

Update a resource.name that is referenced in a foreign key

from frictionless import transform, steps, Package
from pprint import pprint

dp = Package('https://raw.githubusercontent.com/splor-mg/datapackage-reprex/main/reprex/20230725T075521/datapackage.json')

print(f"{dp.get_resource('fact').validate().valid=}")

transform(dp, steps=[
    steps.resource_transform(name='dim', steps=[
        steps.resource_update(name='dim', descriptor={'name': 'dm'})
    ])
])

pprint(dp.get_resource('fact').validate().flatten(['title', 'message']))

which yields

dp.get_resource('fact').validate().valid=True
[['Resource Error',
  'The data resource has an error: failed to handle a foreign key for resource '
  '"fact" as resource "dim" does not exist']]

Update a field.name that is referenced in a foreign key

from frictionless import transform, steps, Package
from pprint import pprint

dp = Package('https://raw.githubusercontent.com/splor-mg/datapackage-reprex/main/reprex/20230725T075521/datapackage.json')

print(f"{dp.get_resource('fact').validate().valid=}")

transform(dp, steps=[
    steps.resource_transform(name='dim', steps=[
        steps.field_update(name='uo', descriptor={'name': 'uo_cod'})
    ])
])

pprint(dp.get_resource('fact').validate().flatten(['title', 'message']))

which yields

dp.get_resource('fact').validate().valid=True
[['ForeignKey Error',
  'Row at position "2" violates the foreign key: for "uo_cod": values "1501" '
  'not found in the lookup table "dim" as "uo"'],
 ['ForeignKey Error',
  'Row at position "3" violates the foreign key: for "uo_cod": values "1251" '
  'not found in the lookup table "dim" as "uo"'],
 ['ForeignKey Error',
  'Row at position "4" violates the foreign key: for "uo_cod": values "1251" '
  'not found in the lookup table "dim" as "uo"']]

In this case I also can't export a valid descriptor with dp.to_yaml('datapackage.yaml') because I get

$ frictionless validate datapackage.yaml
─────────────────────────────────────────────────────────────── Dataset ────────────────────────────────────────────────────────────────
                 dataset                  
┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ name ┃ type  ┃ path          ┃ status  ┃
┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ fact │ table │ data/fact.txt │ INVALID │
│ dim  │ table │ <memory>      │ INVALID │
└──────┴───────┴───────────────┴─────────┘
──────────────────────────────────────────────────────────────── Tables ────────────────────────────────────────────────────────────────
                                               fact                                               
┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Row  ┃ Field ┃ Type         ┃ Message                                                          ┃
┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ None │ None  │ source-error │ The data source has not supported or has inconsistent contents:  │
└──────┴───────┴──────────────┴──────────────────────────────────────────────────────────────────┘
                                               dim                                                
┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Row  ┃ Field ┃ Type         ┃ Message                                                          ┃
┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ None │ None  │ source-error │ The data source has not supported or has inconsistent contents:  │
└──────┴───────┴──────────────┴──────────────────────────────────────────────────────────────────┘

Do these make sense?

@roll
Copy link
Member

roll commented Jul 31, 2023

Thanks @fjuniorr!

@shashigharti can you pleas take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment