Wednesday, June 27, 2012

CUDA Basic

  • Allocate memory:
    int *dev_a, *dev_b, *dev_c;
    // allocate the memory on the GPU
    HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

  • Copy data to GPU

    // copy the arrays 'a' and 'b' to the GPU
    HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
                              cudaMemcpyHostToDevice ) );
    HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
                              cudaMemcpyHostToDevice ) );

  • Run function ‘add( )’ and copy data back from GPU to CPU

    add<<<N,1>>>( dev_a, dev_b, dev_c );
    // copy the array 'c' back from the GPU to the CPU
    HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
                              cudaMemcpyDeviceToHost ) );
    
    // free the memory allocated on the GPU
        cudaFree( dev_a );
        cudaFree( dev_b );
        cudaFree( dev_c );

  • in ‘add<<<N,1>>>’, N represents the number of parallel blocks in which we would like the device to execute the function. No dimension of launch of blocks may exceed 65,535. The second parameter represents the number of threads per block we want the CUDA runtime to create. The number cannot exceed the value specified by the maxThreadsPerBlock. How to tell which block is currently running?

    __global__ void add( int *a, int *b, int *c ) {
        int tid = blockIdx.x;    // handle the data at this index
        if (tid < N)
            c[tid] = a[tid] + b[tid]; 
    }

    there is no need to define the variable blockIdx, it is one of the built-in variable that the CUDA runtime defines.

  • Thread: parallel threads within a block will have the ability to do things that parallel blocks cannot do.

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    cuda_thread_block


    Because CPU thread management and scheduling must be done in software, it simply cannot scale to the number of threads that a GPU can. Because we can simply create a

    thread for each data element we want to process, parallel programming on a GPU can be far simpler than on a CPU.

  • CUDA C keyword __shared__ to make this variable

    resident in shared memory.  The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU. Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory

    buffers reside physically on the GPU as opposed to residing in off-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buffers, making shared memory effective as a per-block, software-managed cache or scratchpad.


  • Synchronization: __syncthreads() This call guarantees that every thread in the block has completed instructions

    prior to the __syncthreads() before the hardware will execute the next instruction on any thread. When the first thread executes the first instruction after our __syncthreads(),
    every other thread in the block has also finished executing up to the __syncthreads(). In following example, we use multiple threads to calculate the pixel values in a block and displace it. The first image is without using __syncthreads(). This was asked in summer 2009.


    No_Sync


    good_sync 

  • An event in CUDA is a GPU time stamp that is recorded at a user specified point in time.

    cudaEvent_t start, stop; 
    cudaEventCreate(&start); 
    cudaEventCreate(&stop); 
    cudaEventRecord( start, 0 );
    // do some work on the GPU
    cudaEventRecord( stop, 0 ); 
    cudaEventSynchronize( stop );

No comments:

Post a Comment