• Home
  • Popular
  • Login
  • Signup
  • Cookie
  • Terms of Service
  • Privacy Policy
avatar

Posted by John Dev


08 Jan, 2025

Updated at 20 Jan, 2025

How do I parallelise this loop with Rayon? (iteration by non-contigous blocks)

I'm trying to iterate over pixels in an image (a flat Vec<u16>) by blocks of 2x2 pixels. I have a function block_to_indices() that returns the indices of the pixels in the i-th block:

fn block_to_indices(width: usize, i: usize) -> [usize; 4] {
	let row_floored = 2 * i / width;
	let row_i = row_floored * width;
	[
		row_i         + 2 * i, row_i         + 2 * i + 1,
		row_i + width + 2 * i, row_i + width + 2 * i + 1,
	]
}

For example, for an image of 6 pixels wide, for the the 0-th block, this gives [0, 1, 6, 7].

I'm iterating over the blocks like this, appending the pixel values to another array result, depending on if any of the pixels in the 2x2 block are saturated:

let mut result: Vec<Vec<f32>> = vec![Vec::with_capacity(n_images); a[0].len()];
(0..vec.len() / 4).into_iter().for_each(|block_i| {
	let block = block_to_indices(width, block_i);
	if !is_saturated(vec, block, wl) {
		block
			.iter()
			.for_each(|&i| {
				let corrected_val = vec[i].saturating_sub(bl) as f32
					/ 2f32.powi(evs[image_i]) + bl as f32;
				result[i].push(corrected_val);
			})
	}
})

Now, I'm trying to parallelise this loop using Rayon. I tried just replacing into_iter() with into_par_iter(), but of course the issue is that I then cannot write to result from multiple threads.

I tried to reshape this loop into something that I can write using a map(), but that is a bit difficult. I could, instead of looping over block_i, loop over the individual pixels, and cook up a function that calculates the block_i from the pixel index (i.e. the inverse of block_to_indices), but that means I'm calling is_saturated 4x as often as is required.

Rayon does have par_chunks(), which allows you to iterate by chunks, but the problem is that in my case the chunks are non-contiguous, so I cannot use that. My blocks/chunks are non-overlapping though, so in principle writing to result[i] should be sound (i.e. no two threads are ever writing to the same element of result), even though I don't know how to prove that to the compiler.

Another method could be to map() and collect() the results per block, and then later re-order them, but I feel like cost of the re-ordering (which needs to be single-threaded) would negate the benefits of parallelise the above loop.

Any ideas? Is it possible to keep the above approach, but convince the borrow-checker in some way that the blocks are non-overlapping? Or, maybe using unsafe{}?

2 posts - 2 participants

Read full topic