I'm trying to iterate over pixels in an image (a flat Vec<u16>
) by blocks of 2x2 pixels. I have a function block_to_indices()
that returns the indices of the pixels in the i-th block:
fn block_to_indices(width: usize, i: usize) -> [usize; 4] {
let row_floored = 2 * i / width;
let row_i = row_floored * width;
[
row_i + 2 * i, row_i + 2 * i + 1,
row_i + width + 2 * i, row_i + width + 2 * i + 1,
]
}
For example, for an image of 6 pixels wide, for the the 0-th block, this gives [0, 1, 6, 7]
.
I'm iterating over the blocks like this, appending the pixel values to another array result
, depending on if any of the pixels in the 2x2 block are saturated:
let mut result: Vec<Vec<f32>> = vec![Vec::with_capacity(n_images); a[0].len()];
(0..vec.len() / 4).into_iter().for_each(|block_i| {
let block = block_to_indices(width, block_i);
if !is_saturated(vec, block, wl) {
block
.iter()
.for_each(|&i| {
let corrected_val = vec[i].saturating_sub(bl) as f32
/ 2f32.powi(evs[image_i]) + bl as f32;
result[i].push(corrected_val);
})
}
})
Now, I'm trying to parallelise this loop using Rayon. I tried just replacing into_iter()
with into_par_iter()
, but of course the issue is that I then cannot write to result
from multiple threads.
I tried to reshape this loop into something that I can write using a map()
, but that is a bit difficult. I could, instead of looping over block_i
, loop over the individual pixels, and cook up a function that calculates the block_i
from the pixel index (i.e. the inverse of block_to_indices
), but that means I'm calling is_saturated
4x as often as is required.
Rayon does have par_chunks()
, which allows you to iterate by chunks, but the problem is that in my case the chunks are non-contiguous, so I cannot use that. My blocks/chunks are non-overlapping though, so in principle writing to result[i]
should be sound (i.e. no two threads are ever writing to the same element of result
), even though I don't know how to prove that to the compiler.
Another method could be to map()
and collect()
the results per block, and then later re-order them, but I feel like cost of the re-ordering (which needs to be single-threaded) would negate the benefits of parallelise the above loop.
Any ideas? Is it possible to keep the above approach, but convince the borrow-checker in some way that the blocks are non-overlapping? Or, maybe using unsafe{}
?
2 posts - 2 participants