定义

CUDA 中 x、y、z 采用笛卡尔坐标系 x表示横向, y表示纵向…

myKernel<<<gridSize, blockSize, sharedMemBytes, stream>>>(...

参数说明:

  • gridSize: [gridDim.x, gridDim.y, gridDim.z]
    gridDim.x : [0, int32]
    gridDim.y : [0, uint16]
    gridDim.z : [0, uint16]
  • blockSize:[blockDim.x, blockDim.y, blockDim.z]
    blockDim.x ≤ 1024
    blockDim.y ≤ 1024
    blockDim.z ≤ 64
    blockDim.x × blockDim.y × blockDim.z≤1024
  • sharedMemBytes
    每个block共享大小

全局索引计算

假设tensor的维度[depth, height, width]

一维

全局线程索引:

int tid = blockIdx.x * blockDim.x + threadIdx.x;

二维

全局坐标:

int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;

转换为线性索引:

int index = global_y * width + global_x;

三维

全局坐标:

int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int tidz = blockIdx.z * blockDim.z + threadIdx.z;

转换为线性索引:

int index = tidz * (height * width) + tidy * width + tidx;

block内计算:

int local_thread_id = threadIdx.z * (blockDim.y * blockDim.x) + threadIdx.y * blockDim.x + threadIdx.x;

Warp计算

数学等价:

x (mod n) = x & (n - 1); // n是2的幂
x / 32 = x >> 5;

计算 warp 内线程 id(lane id)

int lane_id = threadIdx.x % warpSize;
int lane_id = threadIdx.x & (warpSize - 1);

计算 warp id

int warp_id = threadIdx.x / warpSize;
int warp_id = threadIdx.x >> 5;

block 内 warp 数量

int warps_per_block = blockDim.x / warpSize;

block 内 warp 数量

int warps_per_block = blockDim.x / warpSize;