Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move datasets between 4CAT servers #375

Merged
merged 40 commits into from
Oct 25, 2023
Merged

Move datasets between 4CAT servers #375

merged 40 commits into from
Oct 25, 2023

Conversation

stijn-uva
Copy link
Member

@stijn-uva stijn-uva commented Jul 11, 2023

Fixes #352. Work in progress. Basic architecture:

  • API endpoint that returns one of four components of a dataset:

    • Metadata (basically the database row, plus the 4CAT version)
    • The dataset log file
    • The data file
    • A list of keys of child datasets
  • Worker that takes a list of dataset URLs and a 4CAT API key and, using these endpoints, 'reconstructs' the dataset locally, and queues additional jobs for the child datasets (if there are any)

Datasets can only be moved between 4CAT servers of the same version. This precludes one use case - moving from an older to a newer 4CAT - but the alternative is a recipe for complications, because the database structure can change between versions.

stijn-uva and others added 7 commits July 10, 2023 20:34
…CAT version for comparison, run some tests,

I had no idea where you left off, so I just wanted to test it and see that way. Made some fixes and now see that you worked up until the log and actual data.
this uses the commit, which makes sense at the moment, but perhaps not in the long term.
@dale-wahl
Copy link
Member

dale-wahl commented Aug 16, 2023

Tested between two instances of 4CAT successfully!

A few notes to consider before a merge:

  • I did not queue new workers for the children and instead handled them in the same worker. This seems fine to me, but there is no check to ensure a child is completed--I managed to import an unfinished processor!
  • I am using Flask's send_from_directory to serve the files. The function you wrote worked for CSV/NDJSON, but not archives. We may want to rewrite your original function to chunk files (and possibly thread the stream as this will likely hold up our frontend for a while!), but we do use that function for browser downloads already without issue.
  • I was not sure what the final import dataset should look like. To make it work, I just built a CSV. It would probably be cleaner to link directly to the imported dataset, but I hadn't really thought about how to do that yet... and I liked the output when developing as it let me track what parts were failing! I did make links to the final in the dataset, but apparently our CSV preview does not encode links as clickable objects (could be a cool addition).

# Conflicts:
#	datasources/fourcat_import/import_4cat.py
@dale-wahl
Copy link
Member

When we merge this, it is going to make testing on problem datasets so much easier!

@stijn-uva stijn-uva marked this pull request as ready for review October 19, 2023 17:01
@stijn-uva
Copy link
Member Author

stijn-uva commented Oct 19, 2023

This now seems to work, with some limitations and caveats:

  • 4CAT tries to give imported datasets the same key as they had before, but if that key already exists, it will assign a new locally unique key to the dataset
  • Datasets without a data file are not imported. This particularly includes the results of filters that create a new standalone dataset.
  • One dataset can be imported at the time. The log of the 'main' dataset will also include logs and status updates related to the import process even if they are related to co-imported child datasets
  • This seems to be quite slow locally, for some reason, but that may be a me issue rather than a 4CAT issue
  • The anonymisation options, etc, don't do anything. They should probably be hidden, but hiding them based on the selected datasource is currently not possible

It would be quite easy to allow people to import multiple datasets by providing a list of URLs instead of a single URL. The code is already set up for this. However, it may not be intuitive that the front-end acts as if you're creating a single dataset instead of all the datasets you're trying to import. The 'Create dataset' page is currently set up to create one and only one dataset. So the back-end is set up for larger imports but there are some UI problems to solve to make it possible. Perhaps this could be a 'power user' option that would need to be enabled by admins (though we arguably already have too many of such options).

@stijn-uva stijn-uva merged commit 0de7baf into master Oct 25, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Facilitate moving datasets between instances
2 participants