Although this can be addressed with a 16 word padding this is wasteful of precious local memory which might mean the code can't run in parallel to the extent it might otherwise, or simply cannot fit.
A simple trick which still keeps the addressing quite simple is to shift every odd line to the second half of the data which is then offset by 16 words. In effect bit 0 of the y address is shifted to the top of the addressing index, with an offset.
For example, if a kernel is working on a 16x16 region of memory but requires some data either side of the target tile, it might do something like:
local float ldata[32*16];
int lx = get_local_id(0);
int ly = get_local_id(1);
int i = lx + ly * 32;
// load data
ldata[i] = ..read data block 8 to the left ...;
ldata[i+16] = ..read data 8 to the right...;
// work using i+8 as the centre pixel for this thread
By only changing the calculation of
iand padding the storage with only 16 words, the contention is easily removed without changing any other code:
local float ldata[32*16+16];
int i = lx + ( ly >> 1 ) * 32 + (32*8+16)*(y & 1);
Assuming one is only working in the X direction, for Y the addressing is slightly more complex of course. But this could come at no extra run-time cost once the loops are unwound.