Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch exports to Zstd, from bzip2 #3148

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

linkmauve
Copy link
Contributor

The main benefit of this format for our users is that it decompresses much faster than bzip2, even at high compression levels.

At level 19 it compresses even better than bzip2 for our files, hopefully the compression time is still acceptable, if not we can reduce it as to not overwork the server, at the price of some slightly bigger files.

On my i7-8700K, unarchiving sentences.tar.bz2 takes 15.5s, compared to 994ms for sentences.csv.zst compressed at level 19. The file is 183 MiB compared to 197 MiB with bzip2. We could go down to 167 MiB with level 22 (which decompresses in 941ms), but compression time starts to get much higher, not sure this is worth it.

The only downside I see to this change is that user automation will have to be changed, so perhaps announce it somehow before deploying it.

@jiru
Copy link
Member

jiru commented Dec 10, 2024

Thank you, this is a welcomed improvement, but I wonder if the performance benefits are really worth breaking data consumer’s workflows all of a sudden.

I think it would be nicer for data consumers to have a period of transition when we both produce old bzip2 and new zstd files, while only advertising zstd files on the https://tatoeba.org/downloads page. We can decide to later remove bzip2 file generation code whenever we think it’s a good time to do so. We could rely on HTTP logs to check how often the bzip2 files are being used. You could also add a comment in the bzip2 file generation code to remind developers it should be removed at some point.

Also I can see your PR is removing the tar-ing part, can you elaborate on this change?

The main benefit of this format for our users is that it decompresses
much faster than bzip2, even at high compression levels.

At level 19 it compresses even better than bzip2 for our files,
hopefully the compression time is still acceptable, if not we can reduce
it as to not overwork the server, at the price of some slightly bigger
files.

On my i7-8700K, unarchiving sentences.tar.bz2 takes 15.5s, compared to
994ms for sentences.csv.zst compressed at level 19.  The file is 183 MiB
compared to 197 MiB with bzip2.  We could go down to 167 MiB with level
22 (which decompresses in 941ms), but compression time starts to get
much higher, not sure this is worth it.

The only downside I see to this change is that user automation will have
to be changed, so perhaps announce it somehow before deploying it.

I’ve also removed the tar step, which only added overhead since we only
ever created a single archive per file.
@linkmauve
Copy link
Contributor Author

I had mentioned such a transition period in the chat, but nobody reacted. This is now added with a TODO comment.

I’ve also edited the commit message to mention why the tar archive was useless: it only ever contained a single file.

@jiru jiru added this to the next milestone Feb 1, 2025
@jiru
Copy link
Member

jiru commented Feb 6, 2025

Thanks for the update. I found out a few problems with the export script.

Problem 1

The dependency zstd needs to be installed. Of course I can just apt-get install zstd, but this should also be included as part of the TatoVM provision script. The scripts are located in ansible/roles/. You can create a new role, or alternatively, add it to setup_mysql (not directly related to mysql, but it’s the closest role I can think of). Have a look at the Ansible documentation about installing packages.

Problem 2

The script asks for user input. Here is a transcript:

vagrant@tatovm:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
Starting export at 2025-02-06T08:17:44+00:00
Starting SQL scripts at 2025-02-06T08:17:44+00:00
Starting compressing at 2025-02-06T08:17:44+00:00
sentences_base.csv   : 19.45%   (  6844 =>   1331 bytes, sentences_base.csv.zst) 
sentences_detailed.csv :  5.26%   ( 89744 =>   4719 bytes, sentences_detailed.csv.zst) 
links.csv            : 71.82%   (   110 =>     79 bytes, links.csv.zst)        
sentences.csv        : 12.52%   ( 30812 =>   3859 bytes, sentences.csv.zst)    
contributions.csv    :  4.01%   (171113 =>   6858 bytes, contributions.csv.zst) 
sentence_comments.csv :1300.00%   (     0 =>     13 bytes, sentence_comments.csv.zst) 
wall_posts.csv       :1300.00%   (     0 =>     13 bytes, wall_posts.csv.zst)  
tags.csv             : 57.45%   (   141 =>     81 bytes, tags.csv.zst)         
user_lists.csv       : 61.27%   (   204 =>    125 bytes, user_lists.csv.zst)   
sentences_in_lists.csv :168.42%   (    19 =>     32 bytes, sentences_in_lists.csv.zst) 
jpn_indices.csv      :1300.00%   (     0 =>     13 bytes, jpn_indices.csv.zst) 
sentences_with_audio.csv : 26.87%   (   856 =>    230 bytes, sentences_with_audio.csv.zst) 
user_languages.csv   : 52.78%   (   252 =>    133 bytes, user_languages.csv.zst) 
tags_detailed.csv    : 21.55%   (   710 =>    153 bytes, tags_detailed.csv.zst) 
sentences_CC0.csv    :1300.00%   (     0 =>     13 bytes, sentences_CC0.csv.zst) 
transcriptions.csv   : 81.12%   (   535 =>    434 bytes, transcriptions.csv.zst) 
zstd: sentences_base.csv.zst already exists; overwrite (y/N) ? 

Since it’s a cronjob, we can’t have the export script ask for any input, no matter it’s overwrite confirmation or anything else. By the way, re-running the script will have it ask for confirmation on every single archive.

Problem 3

After confirming overwrite, the export script fails with the error:

find: ‘compress_tsv’: No such file or directory

@jiru
Copy link
Member

jiru commented Feb 6, 2025

I advise you to test the export script using TatoVM so that you are running the same environment as the production server.

You’ll have to execute these commands from inside the VM:

# Just run this command once
sudo ln -s /home/vagrant/Tatoeba /var/www-prod
# This creates the files in /var/www-downloads/exports/
cd Tatoeba; sudo ./docs/cron/runner.sh ./docs/cron/export.sh

@jiru jiru removed this from the next milestone Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants