Add cell variants for generic load/store #4

197g · 2025-07-21T10:42:58Z

Add Is128CellUnaligned, Is256CellUnaligned traits for generic load/store of &Cell<[T; N]> and &[Cell<T>; N] behind shared references that may overlap.

As by experience from image-canvas and also extrapolating moxcms we probably do not want to restrict these intrinsics to interactions with the Rust alias model. In particular loads and stores can also work on cells. This is sort of a minimal PR to start this discussion.

In the Rust type system Cells are fundamentally different from normal references, in that they may be aliased by other pointers / other cells. However, the load and store instructions do not care about this as they operate on memory and these intrinsics do not enforce aliasing.

This even applies to storeu2 which are macros around a set of two stores of size u128.

This for instance would allow easily composing some stream operations to happen partially in-place (e.g. a[i] = a[i] + a[i+1]) by using Cell::from_mut and then iterating over two different overlapping slices of cells for defining inputs and outputs.

In the Rust type system these are fundamentally different from normal references, in that they may be aliased by other pointers. However, the load and store instructions do not care about this as they operate on memory and these intrinsics do not enforce aliasing. This even applies to storeu2 which are macros around a set of two stores of size u128.

okaneco · 2025-07-21T15:48:46Z

I read some of the discussion and the high level idea of doing things in-place is compelling, so if it solves your use cases I'd be happy to add it.

However, I don't use Cell often, so it'd be helpful if you could demonstrate that overlapping slice idea with an example.

The load functions seem fine but I'm wondering about the usefulness of store variants:
Wouldn't the already existing &mut T functions cover the use case? If you already have a &Cell<[T; N]> whose inner type implements the SIMD traits, then you could call get_mut to access the inner type.

From my memory, Cell<[T]> isn't that useful and you need to call as_slice_of_cells to get a sized type for slicing operations/iteration over ranges. Then you'd want to implement impl Is128BitsUnaligned for [Cell<T>; N] {} to be able to use the intrinsics.

Quick sketch of what I mean

use std::arch::x86_64::{self as arch, __m128i};
use std::cell::Cell;
use std::ptr;

// Boilerplate, approximating this crate's traits
mod private {
    pub trait Sealed {}
}
impl private::Sealed for [Cell<i32>; 4] {}
pub trait Is128BitsUnaligned: private::Sealed {}
impl Is128BitsUnaligned for [Cell<i32>; 4] {}

fn main() {
    let mut x = [1i32, 2, 3, 4, 5, 6, 7, 8];
    test(&mut x);
    dbg!(x); // x = [ 3, 5, 7, 9, 5, 6, 7, 8, ]
}

// Add [0..4] with [1..5], store in [0..4]
fn test(x: &mut [i32]) {
    let cell = Cell::from_mut(x);
    let cells = cell.as_slice_of_cells();
    let cell1: &[Cell<i32>; 4] = cells[..][..4].try_into().unwrap();
    let cell2: &[Cell<i32>; 4] = cells[1..][..4].try_into().unwrap();
    
    let a = unsafe { _mm_loadu_si128_cell(cell1) };
    let b = unsafe { _mm_loadu_si128_cell(cell2) };
    let c = unsafe { arch::_mm_add_epi32(a, b) };
    
    unsafe { _mm_storeu_si128_cell(cell1, c) };
}


#[target_feature(enable = "sse2")]
pub fn _mm_loadu_si128_cell<T: Is128BitsUnaligned>(mem_addr: &T) -> __m128i {
    unsafe { arch::_mm_loadu_si128(ptr::from_ref(mem_addr).cast()) }
}

// Note this implementation is different.
// I think `cast_mut` is safe here because we originally derived it from the mutable input slice,
// passes Miri on the playground
#[target_feature(enable = "sse2")]
pub fn _mm_storeu_si128_cell<T: Is128BitsUnaligned>(mem_addr: &T, a: __m128i) {
    unsafe { arch::_mm_storeu_si128(ptr::from_ref(mem_addr).cast_mut().cast(), a) }
}

Aside/bikeshed thoughts after the above question is resolved, noting it for later me:

If added, I think methods that take Cell arguments should be in a pub mod within the architecture module like src/x86/cell.rs. Then, the functions don't need the _cell suffix.
Would need some tests so it can be run in Miri, but I understand this is a minimal PR for discussion now.

197g · 2025-07-21T16:09:50Z

As written in Cell::swap:

Using just standard library methods, it is impossible to create such partially overlapping Cells. However, unsafe code is allowed to e.g. create two &Cell<[i32; 2]> that partially overlap.

The standard library will probably create such methods eventually. For now, it is sound and miri certifiable to convert the slice from as_slice_of_cells back into a &Cell<[T]> even though there is no safe method for doing so in the standard library. The same is true for arrays of cells. Which interface to add will probably depend on the preferred target here. But note that get_mut will only work on a mutable cell whereas the real point of Cell is that we can load and store with a shared cell, too, as the intrinsics will never create references into the internals.

This allows us to work on shared references of different types of cell wrapped memory, importantly `Cell<[T; N]>` and `[Cell<T>; N]` will both work with the cell interfaces.

197g · 2025-07-21T16:58:53Z

I've basically switched to your implementation sketch with some additions to make sure both kinds of arrays ([Cell<T>; N] and Cell<[T; N]>) can be passed to the cell methods. Note that these implementations pass miri. cast_mut passes miri since it refers to memory within an UnsafeCell. This memory behaves as shared-read-write under a shared reference. As corollary a Cell must be be a mutable allocation, either because it was created with a cell type or by from_mut with an underlying mutable slice.

okaneco · 2025-07-21T17:23:03Z

The standard library will probably create such methods eventually.

Ah, I didn't know that. Hadn't scrolled up to see that in the tracking issue.

But note that get_mut will only work on a mutable cell whereas the real point of Cell is that we can load and store with a shared cell, too, as the intrinsics will never create references into the internals.

Right, I forgot about self needing to be mutable here which isn't likely, despite reading it over several times.

Really happy with the new changes you've pushed.
I think another trait for handling both array types works out nicely.

I have some review comments to make but otherwise it looks almost good to go.

If you want, you can bump the version to 0.1.2 and update the readme referencing this PR and title following the 0.1.1 format.
I can publish a release after the merge passes CI unless you had other/more changes in mind.

src/x86.rs

src/x86/sse2.rs

197g · 2025-07-21T20:29:32Z

I want to make another pass, making sure that all the relevant interfaces are covered and have tests, etc. There's no need to rush it and a few more things on my plate. Should be done within this week though.

okaneco · 2025-07-22T12:08:20Z

Sounds good to me.

The current tests only cover the as_slice_of_cells cases.
It'd be good to have coverage for loadu/storeu of &Cell<T: Is[128|256]CellUnaligned>.

Also if you could update the PR message to add a line about whatever this PR ended up being. For example,

Add Is128CellUnaligned, Is256CellUnaligned traits for generic load/store of &Cell<[T; N]> and [Cell<T>; N] arrays

Experience/Feedback

I was playing with the new interface and it's surprisingly ergonomic.
Noticeably, it lets you avoid writing the slicing mutable reference for storing by reusing the slice of cells you used to load in the first place.
With regular slices, creating the storing reference with (&mut x[..][..N]).try_into().unwrap() is error-prone: it needs to be wrapped in parentheses, otherwise, it creates a temporary reference that doesn't modify the original slice.

Two other thoughts I had:

`impl Is128/256BitsUnaligned for f32/f64`

I think I'm going to add these impls in another PR.

A lot of intrinsics are just typed ways of doing the same thing, so I don't think it makes sense to duplicate everything for Cell variants.
Float impls had little use before your PR, but would now be able to piggyback on the generics work here for slices of f32/f64. Users would just need to do the free type cast to go between __128/d and __128i for load/store.

`movd`/`_mm_loadu_si32`/`_mm_storeu_si32`

These sse2 functions were the other opinionated design decision I had to make for this crate.

pub fn _mm_loadu_si16(mem_addr: &[u8; 2]) -> __m128i;
pub fn _mm_loadu_si32(mem_addr: &[u8; 4]) -> __m128i;
pub fn _mm_loadu_si64(mem_addr: &[u8; 8]) -> __m128i;
pub fn _mm_storeu_si16(mem_addr: &mut [u8; 2], a: __m128i);
pub fn _mm_storeu_si32(mem_addr: &mut [u8; 4], a: __m128i);
pub fn _mm_storeu_si64(mem_addr: &mut [u8; 8], a: __m128i);

The only image-relevant API hole I can think of are the _si32 variants, which are helpful for 4-byte pixel processing.

There are i32 <=> __m128i and i64 <=> __m128i intrinsics in stdarch that you could write a wrapper around and hope they optimize without stack copying.

It's not ideal but not sure it's worth adding 3 more traits like Is{16,32,64}CellUnaligned for these 6 functions at this point in time (or _cell variants that take [Cell<u8>; N] arguments).

197g · 2025-07-22T15:02:40Z

Float impls had little use before your PR, but would now be able to piggyback on the generics work here for slices of f32/f64. Users would just need to do the free type cast to go between __128/d and __128i for load/store.

In the context of image this is not the most pressing issue. Since we have bytemuck it does not matter too much which of the concrete Pod types are available. bytemuck does not compile-time verify size equivalence so casting between [u32; 8] and [f32; 8] will have some unwraps but this is no worse than try_into().unwrap() for slice-array and easily abstracted away (and optimized). In index-ext I've already implement an ergonomic way to denote the array indexing so an analogous Pod-cast might be neat (also see TexelRange). Then again we have slice::as_chunks in 1.88 which addresses this for most of SIMD.

TL;dr you don't need to solve all ergonomic issues within this crate, make sure to focus on the core issue of architecture support.

okaneco · 2025-07-22T21:21:48Z

Thanks for this.

I see you've updated the PR message too, so everything is covered.

197g marked this pull request as draft July 21, 2025 10:43

Change cell interface by separate trait

c307670

This allows us to work on shared references of different types of cell wrapped memory, importantly `Cell<[T; N]>` and `[Cell<T>; N]` will both work with the cell interfaces.

okaneco requested changes Jul 21, 2025

View reviewed changes

src/x86.rs Outdated Show resolved Hide resolved

src/x86/sse2.rs Outdated Show resolved Hide resolved

197g added 2 commits July 22, 2025 21:30

Fixup some documentation and test naming

1231956

Expand test coverage on 64-bit sizes _m128i memops

3667917

197g marked this pull request as ready for review July 22, 2025 20:27

okaneco approved these changes Jul 22, 2025

View reviewed changes

okaneco merged commit 5b51129 into okaneco:master Jul 22, 2025
8 checks passed

197g deleted the cells branch July 22, 2025 21:23

okaneco mentioned this pull request Jul 23, 2025

Add traits for 16, 32, and 64 bit unaligned operations in x86 #6

Merged

okaneco added enhancement New feature or request A-x86/x86_64 x86/x86_64 architecture relnotes PR or issue that should be mentioned in next release notes/changelog update labels Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cell variants for generic load/store #4

Add cell variants for generic load/store #4

Uh oh!

197g commented Jul 21, 2025 •

edited

Loading

Uh oh!

okaneco commented Jul 21, 2025

Uh oh!

197g commented Jul 21, 2025 •

edited

Loading

Uh oh!

197g commented Jul 21, 2025

Uh oh!

okaneco commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

197g commented Jul 21, 2025

Uh oh!

okaneco commented Jul 22, 2025

Uh oh!

197g commented Jul 22, 2025

Uh oh!

okaneco commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Add cell variants for generic load/store #4

Add cell variants for generic load/store #4

Uh oh!

Conversation

197g commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

okaneco commented Jul 21, 2025

Uh oh!

197g commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

197g commented Jul 21, 2025

Uh oh!

okaneco commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

197g commented Jul 21, 2025

Uh oh!

okaneco commented Jul 22, 2025

Experience/Feedback

impl Is128/256BitsUnaligned for f32/f64

movd/_mm_loadu_si32/_mm_storeu_si32

Uh oh!

197g commented Jul 22, 2025

Uh oh!

okaneco commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

197g commented Jul 21, 2025 •

edited

Loading

197g commented Jul 21, 2025 •

edited

Loading

okaneco commented Jul 21, 2025 •

edited

Loading

`impl Is128/256BitsUnaligned for f32/f64`

`movd`/`_mm_loadu_si32`/`_mm_storeu_si32`