`recordsize` considerations for home directories #17326

LunarLambda · 2025-05-12T08:44:35Z

LunarLambda
May 12, 2025

To preface: Yes, I know that ashift=12 and recordsize=4K is a terrible idea, especially on raidz.

I have my Linux home directories on a raidz1 pool, and I've been wondering about optimizing the recordsize for it, to reduce read/write amplification.

I already have a dataset for my steam library with recordsize=1M, which has yielded a nice improvement to compression and speed.

I created a histogram of my home directory by file size:

         0 1361
         1 126
         2  26
         4 960
         8 1879
        16 1468
        32 3173
        64 6317
       128 8273
       256 34164
       512 16407
     1.0Ki 20193
     2.0Ki 29899
     4.0Ki 65995
     8.0Ki 29565
      16Ki 20209
      32Ki 13506
      64Ki 8648
     128Ki 4621
     256Ki 2555
     512Ki 1793
     1.0Mi 1312
     2.0Mi 606
     4.0Mi 342
     8.0Mi 191
      16Mi 104
      32Mi  98
      64Mi  14
     128Mi  20
     256Mi   5
     512Mi   2
     1.0Gi   3

Most of everything lies in the 512-64k range. A lot of it are git repositories, with the rest being various media files (images, music, videos)

I know that when considering record size it's more about I/O size than file size, so this is a flawed heuristic to start with, however most of the programs that write to /home should write the entire file in one go, so I see these as roughly comparable.

From what I understand, the default of recordsize=128K is a "not awful for anything, not amazing for anything" compromise. Does setting something like recordsize=16K or recordsize=32K make sense here? (And maybe something like recordsize=1M for the media directories?)

Since it's a home directory I use daily, I would like to avoid too much experimentation, as meaningfully changing the recordsize of existing files to see if it improves things would require essentially sending/receiving the dataset(s) back and forth.

IvanVolosyuk · 2025-05-12T09:07:40Z

IvanVolosyuk
May 12, 2025

You are using compression, right? So, why do you want to set recordsize smaller if you don't care about I/O size? The files which are smaller than recordsize will use less storage on disk after compression. So, there is no reason to decrease recordsize if you have a lot of small files and most of them are read/written fully every time.

2 replies

LunarLambda May 12, 2025
Author

Because of read/write amplification, no? If a file is small, possibly even smaller after compression, why do I make ZFS read/write/cache entire 128K blocks for it?

EDIT: I'm not sure I fully understand how recordsize concretely affects ZFS operations. Because recordsize is a maximum on block size, blocks can be smaller. but there's also considerations around stripe width in raidz which divide recordsize by the sector (2^ashift) size.

LunarLambda May 12, 2025
Author

It seems like this is only a concern if the I/O size is <block size (which is somewhere between 512 up to recordsize), so e.g., updating a 4K region of a 128K file would be bad. But overwriting a 1K file with new contents would just read/write the single 1K block?

gmelikov · 2025-05-12T10:35:02Z

gmelikov
May 12, 2025
Collaborator

128K is a sweet spot for HDDs latency-wise (because it can't make more than ~300IOPS), for SSDs for home usage - IMHO too
files less than recordsize will use appropritate size until they've grown up to actual recordsize (so, for recordsize=128K file with 3K size will have an actual record size of 4K), so you already have "tuned" behavior by default for majority of your small files
if you use compression - only actual size will be stored

so, 128k default is still good, you may fine tune further, but it will be a your time waste IMHO. 16-32K recordsize is mainly useful for databases.

3 replies

LunarLambda May 12, 2025
Author

I see.

Does the use of a raidz change the consideration at all?
Trying to make sure I understand correctly:

With raidz1, 3 disks, ashift=12 and recordsize=128K:

blocks read/written are always minimum 2^ashift size, so 4K.

stripe width (i.e., how many disks are accessed to write a stripe) can be 2..3, with either 1 or 2 data blocks and 1 parity block

For small files, this means a minimum of 8K is written, and a minimum of 4K is read. However, this is fixed by ashift, so not something preventable by adjusting recordsize.

If recordsize is <= 2^ashift, then stripe width is always forced to be 2, meaning space efficiency is 50% instead of the maximum 66% (n_disks/n_disks - n_parity, or 2/3). With more disks, better efficiency is achieved but higher risk since the likelihood of multiple disks failures becomes higher, without additional redundancy.

In all other cases, recordsize affects things as normal with regards to read/write amplification, with more stripes having to be read/written for larger blocks, which is bad if I/O size is small.

gmelikov May 12, 2025
Collaborator

I've tried to describe raidz allocation here https://openzfs.github.io/openzfs-docs/Basic%20Concepts/RAIDZ.html

Discovered that table may not be available there, so here's my temporary copy https://docs.google.com/spreadsheets/d/1_CO8x03VICdiIMulDjQi9NDBd53qFpUreMQVrF1uS28/edit?usp=sharing

I recommend to check allocation efficiency and run some benchmark, it would be great if you show them here too.

Several years ago I've tested lmdb on HDD mirror, 128K vs 4K (lmdb's native block is 4K) truly had a difference, it was not paramount, but really noticeable, I didn't have much RAM for ARC though.

But I, personally, prefer space efficiency over rmw price. Here are my laptop's NVME, 1.5 years with root-on-zfs:

root@minime:~# smartctl -A /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.22-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    0%
Data Units Read:                    66,634,522 [34.1 TB]
Data Units Written:                 20,898,474 [10.7 TB]
Host Read Commands:                 526,104,961
Host Write Commands:                162,249,636
Controller Busy Time:               556
Power Cycles:                       2,969,186
Power On Hours:                     4,193
Unsafe Shutdowns:                   85
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    28
Critical Comp. Temperature Time:    8
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               48 Celsius
Thermal Temp. 1 Transition Count:   69

dataset stats

# zfs get recordsize,compressratio,compression,used,logicalused | grep -v docker | grep -v '@' | grep -v '#'

rpool                    recordsize     128K            default
rpool                    compressratio  1.26x           -
rpool                    compression    lz4             local
rpool                    used           1.22T           -
rpool                    logicalused    1.52T           -
rpool/ROOT               recordsize     128K            default
rpool/ROOT               compressratio  2.02x           -
rpool/ROOT               compression    lz4             inherited from rpool
rpool/ROOT               used           34.4G           -
rpool/ROOT               logicalused    66.5G           -
rpool/ROOT/debian        recordsize     128K            default
rpool/ROOT/debian        compressratio  2.02x           -
rpool/ROOT/debian        compression    lz4             inherited from rpool
rpool/ROOT/debian        used           34.4G           -
rpool/ROOT/debian        logicalused    66.5G           -
rpool/home               recordsize     128K            default
rpool/home               compressratio  1.30x           -
rpool/home               compression    lz4             inherited from rpool
rpool/home               used           250G            -
rpool/home               logicalused    311G            -
rpool/home/gmelikov      recordsize     128K            default
rpool/home/gmelikov      compressratio  1.30x           -
rpool/home/gmelikov      compression    lz4             inherited from rpool
rpool/home/gmelikov      used           250G            -
rpool/home/gmelikov      logicalused    310G            -
rpool/home/root          recordsize     128K            default
rpool/home/root          compressratio  1.42x           -
rpool/home/root          compression    lz4             inherited from rpool
rpool/home/root          used           222M            -
rpool/home/root          logicalused    306M            -
rpool/var                recordsize     128K            default
rpool/var                compressratio  1.22x           -
rpool/var                compression    lz4             inherited from rpool
rpool/var                used           963G            -
rpool/var                logicalused    1.15T           -
rpool/var/cache          recordsize     128K            default
rpool/var/cache          compressratio  1.04x           -
rpool/var/cache          compression    lz4             inherited from rpool
rpool/var/cache          used           5.29G           -
rpool/var/cache          logicalused    5.49G           -
rpool/var/games          recordsize     128K            default
rpool/var/games          compressratio  1.16x           -
rpool/var/games          compression    lz4             inherited from rpool
rpool/var/games          used           916G            -
rpool/var/games          logicalused    1.04T           -
rpool/var/games/llm      recordsize     128K            default
rpool/var/games/llm      compressratio  1.00x           -
rpool/var/games/llm      compression    lz4             inherited from rpool
rpool/var/games/llm      used           143G            -
rpool/var/games/llm      logicalused    144G            -
rpool/var/games/movies   recordsize     1M              local
rpool/var/games/movies   compressratio  1.00x           -
rpool/var/games/movies   compression    lz4             local
rpool/var/games/movies   used           95.6G           -
rpool/var/games/movies   logicalused    96.0G           -
rpool/var/lib            recordsize     128K            default
rpool/var/lib            compressratio  2.66x           -
rpool/var/lib            compression    lz4             inherited from rpool
rpool/var/lib            used           38.0G           -
rpool/var/lib            logicalused    99.6G           -
rpool/var/log            recordsize     128K            default
rpool/var/log            compressratio  2.41x           -
rpool/var/log            compression    lz4             inherited from rpool
rpool/var/log            used           3.01G           -
rpool/var/log            logicalused    7.26G           -
rpool/var/spool          recordsize     128K            default
rpool/var/spool          compressratio  1.25x           -
rpool/var/spool          compression    lz4             inherited from rpool
rpool/var/spool          used           193M            -
rpool/var/spool          logicalused    242M            -
rpool/var/tmp            recordsize     128K            default
rpool/var/tmp            compressratio  2.26x           -
rpool/var/tmp            compression    lz4             inherited from rpool
rpool/var/tmp            used           333M            -
rpool/var/tmp            logicalused    753M            -

IvanVolosyuk May 12, 2025

For small files, this means a minimum of 8K is written, and a minimum of 4K is read. However, this is fixed by ashift, so not something preventable by adjusting recordsize.

Yes, that's correct for ashift=12, except for embedded records. Really small files will be stored in block pointer itself. Overall, space efficiency of small files is dependent on the ashift, but you will not improve it playing with recordsize.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`recordsize` considerations for home directories #17326

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

recordsize considerations for home directories #17326

Uh oh!

LunarLambda May 12, 2025

Replies: 2 comments · 5 replies

Uh oh!

IvanVolosyuk May 12, 2025

Uh oh!

Uh oh!

LunarLambda May 12, 2025 Author

Uh oh!

LunarLambda May 12, 2025 Author

Uh oh!

gmelikov May 12, 2025 Collaborator

Uh oh!

Uh oh!

LunarLambda May 12, 2025 Author

Uh oh!

gmelikov May 12, 2025 Collaborator

Uh oh!

IvanVolosyuk May 12, 2025

`recordsize` considerations for home directories #17326

LunarLambda
May 12, 2025

Replies: 2 comments 5 replies

IvanVolosyuk
May 12, 2025

LunarLambda May 12, 2025
Author

LunarLambda May 12, 2025
Author

gmelikov
May 12, 2025
Collaborator

LunarLambda May 12, 2025
Author

gmelikov May 12, 2025
Collaborator