Sunday, June 16, 2013

CUDA shared memory array variable


You have two choices for declaring shared memory inside a kernel - static or dynamic. I presume what you are doing at the moment looks something like this:
#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

}
and you would like to be able to easily change BLOCK_SIZE.
One possibility is to continue to use static shared memory allocation, but make the allocation size a template parameter, like this:
template<int blocksize=16>
__global__ void sgemm1(const float *A, const float *B, float *C)
{
    __shared__ float As[blocksize][blocksize];

}
template void sgemm1<16>(const float *, const float *, float *C);
Then you can instantiate as many different block size variants at compile time as you need.
If you want to dynamically allocate the memory, define it like this:
__global__ void sgemm2(const float *A, const float *B, float *C)
{
    extern __shared__ float As[];

} 
and then add the size of the allocation as an argument to the kernel call:
size_t blocksize = BLOCK_SIZE * BLOCK_SIZE;
sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);
If you have multiple statically declared arrays which you wish to replace with dynamically allocated shared memory, then be aware that there is only ever one dynamic shared memory allocation per kernel, so multiple items exits within (share) that memory segment. So if you had something like:
#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

}
you could replace it with:
#define BLOCK_SIZE (16)

__global__ void sgemm3(const float *A, const float *B, float *C)
{
    extern __shared__ buffer[];

    float *As = &buffer[0];
    float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE];

}
and launch the kernel like this:
size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE;
sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);
All are equally valid, although I personally favour the template version because it can allow other compiler optimisation like automatic loop unrolling that the dynamic version cannot without extra work.

Rapid Problem Solving Using Thrust

http://on-demand.gputechconf.com/gtc-express/2011/presentations/Rapid-Problem-Solving-Using-Thrust.pdf

CUDA Libraries - Thrust
http://www.bu.edu/pasi/files/2011/07/Lecture6.pdf

Sunday, June 9, 2013

Array of Vectors using Thrust

thrust::device_vector<float> vectors[3];
//thrust::device_vector<float> *vectors = new thrust::device_vector<float>[3];

vectors[0] = thrust::device_vector<float>(10);
vectors[1] = thrust::device_vector<float>(10);
vectors[2] = thrust::device_vector<float>(10);

Generating a random number vector between 0 and 1.0 using Thrust

http://stackoverflow.com/questions/12614164/generating-a-random-number-vector-between-0-and-1-0-using-thrust


#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>

struct prg
{
    float a, b;

    __host__ __device__
    prg(float _a=0.f, float _b=1.f) : a(_a), b(_b) {};

    __host__ __device__
        float operator()(const unsigned int n) const
        {
            thrust::default_random_engine rng;
            thrust::uniform_real_distribution<float> dist(a, b);
            rng.discard(n);

            return dist(rng);
        }
};


int main(void)
{
    const int N = 20;

    thrust::device_vector<float> numbers(N);
    thrust::counting_iterator<unsigned int> index_sequence_begin(0);

    thrust::transform(index_sequence_begin,
            index_sequence_begin + N,
            numbers.begin(),
            prg(1.f,2.f));

    for(int i = 0; i < N; i++)
    {
        std::cout << numbers[i] << std::endl;
    }

    return 0;
}

Friday, June 7, 2013

CUDA 5 visual studio configuration

http://code.msdn.microsoft.com/windowsdesktop/CUDA-50-and-Visual-Studio-20e71aa1

Hello World CUDA NSight VS2008 Thrust

http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/


http://stackoverflow.com/questions/7303633/cannot-build-cuda-v3-2-project-with-thrust-1-3-in-vs2008
Solution:
1- right-click on project name in solution explorer window
2- Click "Properties"
3- in left window Click Configuration Properties -> Linker
4- set the value of "Additional Library Directories" to 
"$(CUDA_PATH)/lib/$(PlatformName)";"$(NVSDKCOMPUTE_ROOT)/C/common/lib"

5- Right-click on your .cu file
6- Click Properties
7- Click on Cuda Runtime API
8- Set Additional Include Directories to :
$(CUDA_PATH)/include;./;$(NVSDKCOMPUTE_ROOT)/C/common/inc;$(NVSDKCOMPUTE_ROOT)/shared/inc