It is usually caused by bad indexing.
Set cuda memcheck on in cuda-gdb
For non block/threads divisible data size(BLOCK_SIZE), generally, there are twos ways to fix:
Pad the input with extra values to get to a round multiple of your block size, or add a bounds check to the kernel so that only threads in the valid index range do the calculations, and the others just skip them or return early. Be aware of the implications for block and warp level synchronization level primitives if you choose the second option.
From avidday, NVIDIA Forum:
http://forums.nvidia.com/index.php?showtopic=179190
This comment has been removed by the author.
ReplyDeleteYo, thanks for posting this!
ReplyDelete