定义
CUDA 中 x、y、z 采用笛卡尔坐标系 x表示横向, y表示纵向…
myKernel<<<gridSize, blockSize, sharedMemBytes, stream>>>(...参数说明:
- gridSize: [gridDim.x, gridDim.y, gridDim.z]
gridDim.x : [0, int32]
gridDim.y : [0, uint16]
gridDim.z : [0, uint16] - blockSize:[blockDim.x, blockDim.y, blockDim.z]
blockDim.x ≤ 1024
blockDim.y ≤ 1024
blockDim.z ≤ 64
blockDim.x × blockDim.y × blockDim.z≤1024 - sharedMemBytes
每个block共享大小
全局索引计算
假设tensor的维度[depth, height, width]
一维
全局线程索引:
int tid = blockIdx.x * blockDim.x + threadIdx.x;二维
全局坐标:
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;转换为线性索引:
int index = global_y * width + global_x;三维
全局坐标:
int tidx = blockIdx.x * blockDim.x + threadIdx.x;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;
int tidz = blockIdx.z * blockDim.z + threadIdx.z;转换为线性索引:
int index = tidz * (height * width) + tidy * width + tidx;block内计算:
int local_thread_id = threadIdx.z * (blockDim.y * blockDim.x) + threadIdx.y * blockDim.x + threadIdx.x;Warp计算
数学等价:
x (mod n) = x & (n - 1); // n是2的幂
x / 32 = x >> 5;计算 warp 内线程 id(lane id)
int lane_id = threadIdx.x % warpSize;
int lane_id = threadIdx.x & (warpSize - 1);计算 warp id
int warp_id = threadIdx.x / warpSize;
int warp_id = threadIdx.x >> 5;block 内 warp 数量
int warps_per_block = blockDim.x / warpSize;block 内 warp 数量
int warps_per_block = blockDim.x / warpSize;