-
Notifications
You must be signed in to change notification settings - Fork 24
Add data retention policy #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
66d3de7
74fba57
cd4aee0
badd7ea
9332dce
2a2c4ce
f4993a9
134c667
ae277a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Data Retention Policy | ||
|
||
Dandihub data storage on AWS EFS is expensive, and we suppose that significant portions of the data | ||
currently stored are no longer used. Data migration is where the cost becomes extreme. | ||
|
||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Persistent Data locations | ||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Each user has access to 2 locations: `/home/{user}` and `/shared/`. | ||
|
||
Within jupyterhub `/home/{user}` the user always sees `/home/jovyan`, but is stored in EFS as their GitHub | ||
username. | ||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Known cache file cleanup | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
We should be able to safely remove the following: | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `/home/{user}/.cache` | ||
- `nwb_cache` | ||
- Yarn Cache | ||
- `__pycache__` | ||
- pip cache | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case user is still active -- I think it would be useful to report to the long running users, after reaching some threshold on any of those folders (e.g. 50MB) asking to clean them up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @asmacdo, should we add a separate point here about monitoring and reporting the quotas of cache directories for active users?
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
## Determining Last Access | ||
|
||
EFS does not store metadata for the last access of the data. (Though they must track somehow to move to `Infrequent Access`) | ||
|
||
Alternatives: | ||
- use the [jupyterhub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users) check when user last used/logged in to hub. | ||
- dandiarchive login information | ||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Automated Data Audit | ||
|
||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
At some interval (30 days with no login?): | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- find files larger than 100 (?) GB and mtime > 10 (?) days -- get total size and count | ||
- find files larger than 1 (?) GB and mtime > 30 (?) days -- get total size and count | ||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Notify user if: | ||
- any of the above listed thresholds were reached | ||
- total du exceeds some threshold (e.g. 100G) | ||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- total outdated caches size exceeds some threshold (e.g. 1G) | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- prior notification was sent more than a week ago | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Notification information: | ||
- large file list | ||
kabilar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- summarized data retention policy | ||
- Notice number | ||
- request to cleanup | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. meanwhile it might be worth creating a simple data record schema to store those records as well so they could be reused by the tools to assemble higher level stats etc.
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Non-response cleanup | ||
|
||
If a user has not logged in for 60 days (30 days initial + 30 days following audit), send a warning: | ||
`In 10 days the following files will be cleaned up` | ||
|
||
If the user has not logged in for 70 days (30 initial + 30 after audit + 10 warning): | ||
`The following files were removed` | ||
|
||
Reset timer. | ||
asmacdo marked this conversation as resolved.
Show resolved
Hide resolved
|
Uh oh!
There was an error while loading. Please reload this page.