Saturday, November 26, 2011

Programming Massive Parallel Processors

Block Synchronization?

Once a grid is launched, its blocks are assigned to streaming multipro-
cessors in arbitrary order, resulting in transparent scalability of CUDA
applications. The transparent scalability comes with the limitation that
threads in different blocks cannot synchronize with each other. The only
safe way for threads in different blocks to synchronize with each other
is to terminate the kernel and start a new kernel for the activities after the
synchronization point.

Threads are assigned to SMs for execution on a block-by-block basis.
For GT200 processors, each SM can accommodate up to 8 blocks or 1024
threads, whichever becomes a limitation first. Once a block is assigned to
an SM, it is further partitioned into warps. At any time, the SM executes only
a subset of its resident warps for execution. This allows the other warps to wait
for long-latency operations without slowing down the overall execution
throughput of the massive number of execution units.

Friday, November 25, 2011

A discussion of memory allocation

A discussion of CUDA 2D memory allocation:
http://stackoverflow.com/questions/5029920/how-to-use-2d-arrays-in-cuda

It is said that cudamalloc2D/3D is optimized for mulit-dimensional data access.
for instance, Cuda sdk example dct8x8 (cudamalloc2D).

A straightforward guide to fitting CUDA compiler (nvcc) into Netbeans

Google translate from Spanish to English:
http://translate.google.es/translate?js=n&prev=_t&hl=es&ie=UTF-8&layout=2&eotf=1&sl=es&tl=en&u=http%3A%2F%2Fblog.carpente.es%2Fconfigurar-netbeans-para-programar-en-cuda-c%2F&act=url

Uses too much local data?

ptxas error   : Entry function '_Z11Kernel_NamePfS_S_S_S_i' uses too much local data (0x4530 bytes, 0x4000 max)

According to links below:
http://kirschp.blogspot.com/2008/02/aircrack-speed-up-with-cuda.html
http://forums.nvidia.com/index.php?showtopic=196742

(avidday again):
There is a 16kb per thread local memory limit.
After checking my kernel:
BLOCK_SIZE 52 
(2 * (BLOCK_SIZE/2) + BLOCK_SIZE * (BLOCK_SIZE/2) * 3 )  * 4bytes = 17712


Checking under MATLAB:
hex2dec('4530') = 17712, goes over 16kb per thread limit.

Thursday, November 24, 2011

CUDA_EXCEPTION_1: Lane illegal Address

It is usually caused by bad indexing.
Set cuda memcheck on in cuda-gdb

For non block/threads divisible  data size(BLOCK_SIZE), generally, there are twos ways to fix:
Pad the input with extra values to get to a round multiple of your block size, or add a bounds check to the kernel so that only threads in the valid index range do the calculations, and the others just skip them or return early. Be aware of the implications for block and warp level synchronization level primitives if you choose the second option.

From avidday, NVIDIA Forum:
http://forums.nvidia.com/index.php?showtopic=179190

Chinese architecture