Switch exports to Zstd, from bzip2 #3148

linkmauve · 2024-12-09T14:06:43Z

The main benefit of this format for our users is that it decompresses much faster than bzip2, even at high compression levels.

At level 19 it compresses even better than bzip2 for our files, hopefully the compression time is still acceptable, if not we can reduce it as to not overwork the server, at the price of some slightly bigger files.

On my i7-8700K, unarchiving sentences.tar.bz2 takes 15.5s, compared to 994ms for sentences.csv.zst compressed at level 19. The file is 183 MiB compared to 197 MiB with bzip2. We could go down to 167 MiB with level 22 (which decompresses in 941ms), but compression time starts to get much higher, not sure this is worth it.

The only downside I see to this change is that user automation will have to be changed, so perhaps announce it somehow before deploying it.

jiru · 2024-12-10T11:05:02Z

Thank you, this is a welcomed improvement, but I wonder if the performance benefits are really worth breaking data consumer’s workflows all of a sudden.

I think it would be nicer for data consumers to have a period of transition when we both produce old bzip2 and new zstd files, while only advertising zstd files on the https://tatoeba.org/downloads page. We can decide to later remove bzip2 file generation code whenever we think it’s a good time to do so. We could rely on HTTP logs to check how often the bzip2 files are being used. You could also add a comment in the bzip2 file generation code to remind developers it should be removed at some point.

Also I can see your PR is removing the tar-ing part, can you elaborate on this change?

The main benefit of this format for our users is that it decompresses much faster than bzip2, even at high compression levels. At level 19 it compresses even better than bzip2 for our files, hopefully the compression time is still acceptable, if not we can reduce it as to not overwork the server, at the price of some slightly bigger files. On my i7-8700K, unarchiving sentences.tar.bz2 takes 15.5s, compared to 994ms for sentences.csv.zst compressed at level 19. The file is 183 MiB compared to 197 MiB with bzip2. We could go down to 167 MiB with level 22 (which decompresses in 941ms), but compression time starts to get much higher, not sure this is worth it. The only downside I see to this change is that user automation will have to be changed, so perhaps announce it somehow before deploying it. I’ve also removed the tar step, which only added overhead since we only ever created a single archive per file.

linkmauve · 2024-12-10T11:20:36Z

I had mentioned such a transition period in the chat, but nobody reacted. This is now added with a TODO comment.

I’ve also edited the commit message to mention why the tar archive was useless: it only ever contained a single file.

jiru · 2025-02-06T08:58:46Z

Thanks for the update. I found out a few problems with the export script.

Problem 1

The dependency zstd needs to be installed. Of course I can just apt-get install zstd, but this should also be included as part of the TatoVM provision script. The scripts are located in ansible/roles/. You can create a new role, or alternatively, add it to setup_mysql (not directly related to mysql, but it’s the closest role I can think of). Have a look at the Ansible documentation about installing packages.

Problem 2

The script asks for user input. Here is a transcript:

vagrant@tatovm:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
Starting export at 2025-02-06T08:17:44+00:00
Starting SQL scripts at 2025-02-06T08:17:44+00:00
Starting compressing at 2025-02-06T08:17:44+00:00
sentences_base.csv   : 19.45%   (  6844 =>   1331 bytes, sentences_base.csv.zst) 
sentences_detailed.csv :  5.26%   ( 89744 =>   4719 bytes, sentences_detailed.csv.zst) 
links.csv            : 71.82%   (   110 =>     79 bytes, links.csv.zst)        
sentences.csv        : 12.52%   ( 30812 =>   3859 bytes, sentences.csv.zst)    
contributions.csv    :  4.01%   (171113 =>   6858 bytes, contributions.csv.zst) 
sentence_comments.csv :1300.00%   (     0 =>     13 bytes, sentence_comments.csv.zst) 
wall_posts.csv       :1300.00%   (     0 =>     13 bytes, wall_posts.csv.zst)  
tags.csv             : 57.45%   (   141 =>     81 bytes, tags.csv.zst)         
user_lists.csv       : 61.27%   (   204 =>    125 bytes, user_lists.csv.zst)   
sentences_in_lists.csv :168.42%   (    19 =>     32 bytes, sentences_in_lists.csv.zst) 
jpn_indices.csv      :1300.00%   (     0 =>     13 bytes, jpn_indices.csv.zst) 
sentences_with_audio.csv : 26.87%   (   856 =>    230 bytes, sentences_with_audio.csv.zst) 
user_languages.csv   : 52.78%   (   252 =>    133 bytes, user_languages.csv.zst) 
tags_detailed.csv    : 21.55%   (   710 =>    153 bytes, tags_detailed.csv.zst) 
sentences_CC0.csv    :1300.00%   (     0 =>     13 bytes, sentences_CC0.csv.zst) 
transcriptions.csv   : 81.12%   (   535 =>    434 bytes, transcriptions.csv.zst) 
zstd: sentences_base.csv.zst already exists; overwrite (y/N) ?

Since it’s a cronjob, we can’t have the export script ask for any input, no matter it’s overwrite confirmation or anything else. By the way, re-running the script will have it ask for confirmation on every single archive.

Problem 3

After confirming overwrite, the export script fails with the error:

find: ‘compress_tsv’: No such file or directory

jiru · 2025-02-06T09:04:34Z

I advise you to test the export script using TatoVM so that you are running the same environment as the production server.

You’ll have to execute these commands from inside the VM:

# Just run this command once
sudo ln -s /home/vagrant/Tatoeba /var/www-prod
# This creates the files in /var/www-downloads/exports/
cd Tatoeba; sudo ./docs/cron/runner.sh ./docs/cron/export.sh

linkmauve force-pushed the zstd branch from e41635e to 383eb07 Compare December 10, 2024 11:18

jiru added this to the next milestone Feb 1, 2025

jiru removed this from the next milestone Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch exports to Zstd, from bzip2 #3148

Switch exports to Zstd, from bzip2 #3148

linkmauve commented Dec 9, 2024

jiru commented Dec 10, 2024

linkmauve commented Dec 10, 2024

jiru commented Feb 6, 2025 •

edited

Loading

jiru commented Feb 6, 2025

Switch exports to Zstd, from bzip2 #3148

Are you sure you want to change the base?

Switch exports to Zstd, from bzip2 #3148

Conversation

linkmauve commented Dec 9, 2024

jiru commented Dec 10, 2024

linkmauve commented Dec 10, 2024

jiru commented Feb 6, 2025 • edited Loading

Problem 1

Problem 2

Problem 3

jiru commented Feb 6, 2025

jiru commented Feb 6, 2025 •

edited

Loading