Once a grid is launched, its blocks are assigned to streaming multipro-
cessors in arbitrary order, resulting in transparent scalability of CUDA
applications. The transparent scalability comes with the limitation that
threads in different blocks cannot synchronize with each other. The only
safe way for threads in different blocks to synchronize with each other
is to terminate the kernel and start a new kernel for the activities after the
synchronization point.
Threads are assigned to SMs for execution on a block-by-block basis.
For GT200 processors, each SM can accommodate up to 8 blocks or 1024
threads, whichever becomes a limitation first. Once a block is assigned to
an SM, it is further partitioned into warps. At any time, the SM executes only
a subset of its resident warps for execution. This allows the other warps to wait
for long-latency operations without slowing down the overall execution
throughput of the massive number of execution units.