Skip to content

Add cell variants for generic load/store #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 22, 2025
Merged

Conversation

197g
Copy link
Contributor

@197g 197g commented Jul 21, 2025

Add Is128CellUnaligned, Is256CellUnaligned traits for generic load/store of &Cell<[T; N]> and &[Cell<T>; N] behind shared references that may overlap.

As by experience from image-canvas and also extrapolating moxcms we probably do not want to restrict these intrinsics to interactions with the Rust alias model. In particular loads and stores can also work on cells. This is sort of a minimal PR to start this discussion.

In the Rust type system Cells are fundamentally different from normal references, in that they may be aliased by other pointers / other cells. However, the load and store instructions do not care about this as they operate on memory and these intrinsics do not enforce aliasing.

This even applies to storeu2 which are macros around a set of two stores of size u128.

This for instance would allow easily composing some stream operations to happen partially in-place (e.g. a[i] = a[i] + a[i+1]) by using Cell::from_mut and then iterating over two different overlapping slices of cells for defining inputs and outputs.

In the Rust type system these are fundamentally different from normal
references, in that they may be aliased by other pointers. However, the
load and store instructions do not care about this as they operate on
memory and these intrinsics do not enforce aliasing.

This even applies to storeu2 which are macros around a set of two stores
of size u128.
@197g 197g marked this pull request as draft July 21, 2025 10:43
@okaneco
Copy link
Owner

okaneco commented Jul 21, 2025

I read some of the discussion and the high level idea of doing things in-place is compelling, so if it solves your use cases I'd be happy to add it.

However, I don't use Cell often, so it'd be helpful if you could demonstrate that overlapping slice idea with an example.

The load functions seem fine but I'm wondering about the usefulness of store variants:
Wouldn't the already existing &mut T functions cover the use case? If you already have a &Cell<[T; N]> whose inner type implements the SIMD traits, then you could call get_mut to access the inner type.

From my memory, Cell<[T]> isn't that useful and you need to call as_slice_of_cells to get a sized type for slicing operations/iteration over ranges. Then you'd want to implement impl Is128BitsUnaligned for [Cell<T>; N] {} to be able to use the intrinsics.

Quick sketch of what I mean
use std::arch::x86_64::{self as arch, __m128i};
use std::cell::Cell;
use std::ptr;

// Boilerplate, approximating this crate's traits
mod private {
    pub trait Sealed {}
}
impl private::Sealed for [Cell<i32>; 4] {}
pub trait Is128BitsUnaligned: private::Sealed {}
impl Is128BitsUnaligned for [Cell<i32>; 4] {}

fn main() {
    let mut x = [1i32, 2, 3, 4, 5, 6, 7, 8];
    test(&mut x);
    dbg!(x); // x = [ 3, 5, 7, 9, 5, 6, 7, 8, ]
}

// Add [0..4] with [1..5], store in [0..4]
fn test(x: &mut [i32]) {
    let cell = Cell::from_mut(x);
    let cells = cell.as_slice_of_cells();
    let cell1: &[Cell<i32>; 4] = cells[..][..4].try_into().unwrap();
    let cell2: &[Cell<i32>; 4] = cells[1..][..4].try_into().unwrap();
    
    let a = unsafe { _mm_loadu_si128_cell(cell1) };
    let b = unsafe { _mm_loadu_si128_cell(cell2) };
    let c = unsafe { arch::_mm_add_epi32(a, b) };
    
    unsafe { _mm_storeu_si128_cell(cell1, c) };
}


#[target_feature(enable = "sse2")]
pub fn _mm_loadu_si128_cell<T: Is128BitsUnaligned>(mem_addr: &T) -> __m128i {
    unsafe { arch::_mm_loadu_si128(ptr::from_ref(mem_addr).cast()) }
}

// Note this implementation is different.
// I think `cast_mut` is safe here because we originally derived it from the mutable input slice,
// passes Miri on the playground
#[target_feature(enable = "sse2")]
pub fn _mm_storeu_si128_cell<T: Is128BitsUnaligned>(mem_addr: &T, a: __m128i) {
    unsafe { arch::_mm_storeu_si128(ptr::from_ref(mem_addr).cast_mut().cast(), a) }
}

Aside/bikeshed thoughts after the above question is resolved, noting it for later me:

  1. If added, I think methods that take Cell arguments should be in a pub mod within the architecture module like src/x86/cell.rs. Then, the functions don't need the _cell suffix.
  2. Would need some tests so it can be run in Miri, but I understand this is a minimal PR for discussion now.

@197g
Copy link
Contributor Author

197g commented Jul 21, 2025

As written in Cell::swap:

Using just standard library methods, it is impossible to create such partially overlapping Cells. However, unsafe code is allowed to e.g. create two &Cell<[i32; 2]> that partially overlap.

The standard library will probably create such methods eventually. For now, it is sound and miri certifiable to convert the slice from as_slice_of_cells back into a &Cell<[T]> even though there is no safe method for doing so in the standard library. The same is true for arrays of cells. Which interface to add will probably depend on the preferred target here. But note that get_mut will only work on a mutable cell whereas the real point of Cell is that we can load and store with a shared cell, too, as the intrinsics will never create references into the internals.

This allows us to work on shared references of different types of cell
wrapped memory, importantly `Cell<[T; N]>` and `[Cell<T>; N]` will both
work with the cell interfaces.
@197g
Copy link
Contributor Author

197g commented Jul 21, 2025

I've basically switched to your implementation sketch with some additions to make sure both kinds of arrays ([Cell<T>; N] and Cell<[T; N]>) can be passed to the cell methods. Note that these implementations pass miri. cast_mut passes miri since it refers to memory within an UnsafeCell. This memory behaves as shared-read-write under a shared reference. As corollary a Cell must be be a mutable allocation, either because it was created with a cell type or by from_mut with an underlying mutable slice.

@okaneco
Copy link
Owner

okaneco commented Jul 21, 2025

The standard library will probably create such methods eventually.

Ah, I didn't know that. Hadn't scrolled up to see that in the tracking issue.

But note that get_mut will only work on a mutable cell whereas the real point of Cell is that we can load and store with a shared cell, too, as the intrinsics will never create references into the internals.

Right, I forgot about self needing to be mutable here which isn't likely, despite reading it over several times.


Really happy with the new changes you've pushed.
I think another trait for handling both array types works out nicely.

I have some review comments to make but otherwise it looks almost good to go.

If you want, you can bump the version to 0.1.2 and update the readme referencing this PR and title following the 0.1.1 format.
I can publish a release after the merge passes CI unless you had other/more changes in mind.

@197g
Copy link
Contributor Author

197g commented Jul 21, 2025

I want to make another pass, making sure that all the relevant interfaces are covered and have tests, etc. There's no need to rush it and a few more things on my plate. Should be done within this week though.

@okaneco
Copy link
Owner

okaneco commented Jul 22, 2025

Sounds good to me.

The current tests only cover the as_slice_of_cells cases.
It'd be good to have coverage for loadu/storeu of &Cell<T: Is[128|256]CellUnaligned>.

Also if you could update the PR message to add a line about whatever this PR ended up being. For example,

Add Is128CellUnaligned, Is256CellUnaligned traits for generic load/store of &Cell<[T; N]> and [Cell<T>; N] arrays

Experience/Feedback

I was playing with the new interface and it's surprisingly ergonomic.
Noticeably, it lets you avoid writing the slicing mutable reference for storing by reusing the slice of cells you used to load in the first place.
With regular slices, creating the storing reference with (&mut x[..][..N]).try_into().unwrap() is error-prone: it needs to be wrapped in parentheses, otherwise, it creates a temporary reference that doesn't modify the original slice.


Two other thoughts I had:

impl Is128/256BitsUnaligned for f32/f64

I think I'm going to add these impls in another PR.

A lot of intrinsics are just typed ways of doing the same thing, so I don't think it makes sense to duplicate everything for Cell variants.
Float impls had little use before your PR, but would now be able to piggyback on the generics work here for slices of f32/f64. Users would just need to do the free type cast to go between __128/d and __128i for load/store.

movd/_mm_loadu_si32/_mm_storeu_si32

These sse2 functions were the other opinionated design decision I had to make for this crate.

pub fn _mm_loadu_si16(mem_addr: &[u8; 2]) -> __m128i;
pub fn _mm_loadu_si32(mem_addr: &[u8; 4]) -> __m128i;
pub fn _mm_loadu_si64(mem_addr: &[u8; 8]) -> __m128i;
pub fn _mm_storeu_si16(mem_addr: &mut [u8; 2], a: __m128i);
pub fn _mm_storeu_si32(mem_addr: &mut [u8; 4], a: __m128i);
pub fn _mm_storeu_si64(mem_addr: &mut [u8; 8], a: __m128i);

The only image-relevant API hole I can think of are the _si32 variants, which are helpful for 4-byte pixel processing.

There are i32 <=> __m128i and i64 <=> __m128i intrinsics in stdarch that you could write a wrapper around and hope they optimize without stack copying.

It's not ideal but not sure it's worth adding 3 more traits like Is{16,32,64}CellUnaligned for these 6 functions at this point in time (or _cell variants that take [Cell<u8>; N] arguments).

@197g
Copy link
Contributor Author

197g commented Jul 22, 2025

Float impls had little use before your PR, but would now be able to piggyback on the generics work here for slices of f32/f64. Users would just need to do the free type cast to go between __128/d and __128i for load/store.

In the context of image this is not the most pressing issue. Since we have bytemuck it does not matter too much which of the concrete Pod types are available. bytemuck does not compile-time verify size equivalence so casting between [u32; 8] and [f32; 8] will have some unwraps but this is no worse than try_into().unwrap() for slice-array and easily abstracted away (and optimized). In index-ext I've already implement an ergonomic way to denote the array indexing so an analogous Pod-cast might be neat (also see TexelRange). Then again we have slice::as_chunks in 1.88 which addresses this for most of SIMD.

TL;dr you don't need to solve all ergonomic issues within this crate, make sure to focus on the core issue of architecture support.

@197g 197g marked this pull request as ready for review July 22, 2025 20:27
@okaneco
Copy link
Owner

okaneco commented Jul 22, 2025

Thanks for this.

I see you've updated the PR message too, so everything is covered.

@okaneco okaneco merged commit 5b51129 into okaneco:master Jul 22, 2025
8 checks passed
@197g 197g deleted the cells branch July 22, 2025 21:23
@okaneco okaneco added enhancement New feature or request A-x86/x86_64 x86/x86_64 architecture relnotes PR or issue that should be mentioned in next release notes/changelog update labels Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-x86/x86_64 x86/x86_64 architecture enhancement New feature or request relnotes PR or issue that should be mentioned in next release notes/changelog update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants