Move datasets between 4CAT servers #375

stijn-uva · 2023-07-11T09:48:42Z

Fixes #352. Work in progress. Basic architecture:

API endpoint that returns one of four components of a dataset:
- Metadata (basically the database row, plus the 4CAT version)
- The dataset log file
- The data file
- A list of keys of child datasets
Worker that takes a list of dataset URLs and a 4CAT API key and, using these endpoints, 'reconstructs' the dataset locally, and queues additional jobs for the child datasets (if there are any)

Datasets can only be moved between 4CAT servers of the same version. This precludes one use case - moving from an older to a newer 4CAT - but the alternative is a recipe for complications, because the database structure can change between versions.

…CAT version for comparison, run some tests, I had no idea where you left off, so I just wanted to test it and see that way. Made some fixes and now see that you worked up until the log and actual data.

this uses the commit, which makes sense at the moment, but perhaps not in the long term.

…ed up some errors/logging and added notes

dale-wahl · 2023-08-16T12:00:44Z

Tested between two instances of 4CAT successfully!

A few notes to consider before a merge:

I did not queue new workers for the children and instead handled them in the same worker. This seems fine to me, but there is no check to ensure a child is completed--I managed to import an unfinished processor!
I am using Flask's send_from_directory to serve the files. The function you wrote worked for CSV/NDJSON, but not archives. We may want to rewrite your original function to chunk files (and possibly thread the stream as this will likely hold up our frontend for a while!), but we do use that function for browser downloads already without issue.
I was not sure what the final import dataset should look like. To make it work, I just built a CSV. It would probably be cleaner to link directly to the imported dataset, but I hadn't really thought about how to do that yet... and I liked the output when developing as it let me track what parts were failing! I did make links to the final in the dataset, but apparently our CSV preview does not encode links as clickable objects (could be a cool addition).

# Conflicts: # datasources/fourcat_import/import_4cat.py

dale-wahl · 2023-10-03T08:02:36Z

When we merge this, it is going to make testing on problem datasets so much easier!

# Conflicts: # webtool/templates/controlpanel/user-bulk.html

# Conflicts: # datasources/fourcat_import/import_4cat.py

stijn-uva · 2023-10-19T17:07:40Z

This now seems to work, with some limitations and caveats:

4CAT tries to give imported datasets the same key as they had before, but if that key already exists, it will assign a new locally unique key to the dataset
Datasets without a data file are not imported. This particularly includes the results of filters that create a new standalone dataset.
One dataset can be imported at the time. The log of the 'main' dataset will also include logs and status updates related to the import process even if they are related to co-imported child datasets
This seems to be quite slow locally, for some reason, but that may be a me issue rather than a 4CAT issue
~~The anonymisation options, etc, don't do anything. They should probably be hidden, but hiding them based on the selected datasource is currently not possible~~

It would be quite easy to allow people to import multiple datasets by providing a list of URLs instead of a single URL. The code is already set up for this. However, it may not be intuitive that the front-end acts as if you're creating a single dataset instead of all the datasets you're trying to import. The 'Create dataset' page is currently set up to create one and only one dataset. So the back-end is set up for larger imports but there are some UI problems to solve to make it possible. Perhaps this could be a 'power user' option that would need to be enabled by admins (though we arguably already have too many of such options).

…hon3.9 and newer)

stijn-uva and others added 7 commits July 10, 2023 20:34

Export/import datasets between 4CAT servers

9d7cb51

Merge branch 'master' into export-datasets

0e2795d

Merge branch 'master' into export-datasets

cefe041

move datasource and add init, add new setting to allow export, pass 4…

a6a4ae4

…CAT version for comparison, run some tests, I had no idea where you left off, so I just wanted to test it and see that way. Made some fixes and now see that you worked up until the log and actual data.

use version function for comparison

f735496

this uses the commit, which makes sense at the moment, but perhaps not in the long term.

close, but children fail; possibly just zip files

db461ed

use send_from_directory (tested on 1.5 gig files successfully); clean…

33f2faf

…ed up some errors/logging and added notes

Merge branch 'master' into export-datasets

a16f4fe

# Conflicts: # datasources/fourcat_import/import_4cat.py

stijn-uva mentioned this pull request Oct 13, 2023

Feature request: backuping up volumes #391

Closed

stijn-uva added 13 commits October 18, 2023 15:59

Merge branch 'master' into export-datasets

a1eb4be

# Conflicts: # webtool/templates/controlpanel/user-bulk.html

Merge branch 'master' into export-datasets

6aba774

# Conflicts: # datasources/fourcat_import/import_4cat.py

Use custom exception instead of TypeError when dataset not found

16b5d1d

Use 401 HTTP status for login form

4521b0f

Ignore hidden files in cleanup worker

1bdf67c

Ensure unique dataset key and allow changing keys

c09b031

Allow workers to set dataset key manually

557576d

MANY CHANGES

617da89

Don't print log to stdout twice when not running as daemon

e916220

Remove stray debug print

66241c4

Show original timestamp for imported datasets

380e285

Dataset form logic

fb4900c

Don't show empty dataset status

fde5a68

stijn-uva marked this pull request as ready for review October 19, 2023 17:01

stijn-uva requested a review from dale-wahl October 19, 2023 17:02

stijn-uva added 3 commits October 19, 2023 19:16

Forbid importing unfinished datasets

a0645fe

Not using this file

61e4990

Fix copied comment

918e52d

stijn-uva and others added 16 commits October 19, 2023 19:25

Merge branch 'master' into export-datasets

2e50761

Fix interrupt routine and clean up half-imported data

0dd87b1

Catch dataset not found error in expiration worker

58a0481

Hide anon/label options when importing

3279710

Fix delete button on dataset creation page

45414c8

Fix interrupting imports

94604cd

Remove "filtered from" on imported datasets

00528f8

Add DESCRIPTION.md for import data source

c3d7348

Better markdown

301dfed

return error() not raise

0b61d99

use replace on result_file to use new_dataset.key (with_stem is pyt…

c6d2805

…hon3.9 and newer)

Clarify some comments

55a004b

get_software_version() -> get_software_commit()

ad38daa

Use version instead of commit to determine migration compatibility

81d7354

More commentary

a36cbf6

Use reserve_results_file() to ensure correct import data path

5687c3f

stijn-uva merged commit 0de7baf into master Oct 25, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move datasets between 4CAT servers #375

Move datasets between 4CAT servers #375

stijn-uva commented Jul 11, 2023 •

edited

Loading

dale-wahl commented Aug 16, 2023 •

edited

Loading

dale-wahl commented Oct 3, 2023

stijn-uva commented Oct 19, 2023 •

edited

Loading

Move datasets between 4CAT servers #375

Move datasets between 4CAT servers #375

Conversation

stijn-uva commented Jul 11, 2023 • edited Loading

dale-wahl commented Aug 16, 2023 • edited Loading

dale-wahl commented Oct 3, 2023

stijn-uva commented Oct 19, 2023 • edited Loading

stijn-uva commented Jul 11, 2023 •

edited

Loading

dale-wahl commented Aug 16, 2023 •

edited

Loading

stijn-uva commented Oct 19, 2023 •

edited

Loading