Learning to use the tensor memory accelerator (TMA) in the Hopper architecture, to load data from global to shared memory.
We want to be able to reduce global memory access while loading data. To do so, these matrices can be divided into subtiles to load them into shared memory so that threads can access them quicker. However loading these subtiles into shared memory is also repetitive, and requires a lot of computations (such as computing indices, striding and tiling patterns). This is where TMA is helpful, so we can offload these computations, and efficiently load multi-dimensional arrays from global to shared memory.
TMA can be used to tranfer from gmem to smem, smem to gmem, as well as smem to distributed smem. Antoher cool feature is that we can use TMA to multi-cast data - which means we can load the same data from global memory and transfer it to multiple different smem across thread blocks. This multicast feature works best for the sm90 architecture though.
The data transfers are also typically asynchronous, so the thread can continue computing while the data is being async transferred from source to destination.
To signal that the data has finished transferring, TMA has completion mechanisms. One of them is the shared memory barrier.
Tensor maps encodes the dimensions and layout of the array. We can use the CUTensorMap object and use the cuTensorMapEncodeTiled function to create the tensor map. This describes the address, stride, swizzle pattern etc by passing it as parameters. The API from the official docs are here
We need to create a tensor map first in the host side. Here's an example on how to create one:
//Creating a tensor map on the host
//This example is taken from the official docs
CUtensorMap tensor_map{};
// rank is the number of dimensions of the array.
constexpr uint32_t rank = 2;
uint64_t size[rank] = {GMEM_WIDTH, GMEM_HEIGHT};
// The stride is the number of bytes to traverse from the first element of one row to the next.
// It must be a multiple of 16.
uint64_t stride[rank - 1] = {GMEM_WIDTH * sizeof(int)};
// The box_size is the size of the shared memory buffer that is used as the
// destination of a TMA transfer.
uint32_t box_size[rank] = {SMEM_WIDTH, SMEM_HEIGHT};
// The distance between elements in units of sizeof(element). A stride of 2
// can be used to load only the real component of a complex-valued tensor, for instance.
uint32_t elem_stride[rank] = {1, 1};
// Create the tensor descriptor and pass in the values as params
CUresult res = cuTensorMapEncodeTiled(
&tensor_map, // CUtensorMap *tensorMap,
CUtensorMapDataType::CU_TENSOR_MAP_DATA_TYPE_INT32,
rank, // cuuint32_t tensorRank,
tensor_ptr, // void *globalAddress,
size, // const cuuint64_t *globalDim,
stride, // const cuuint64_t *globalStrides,
box_size, // const cuuint32_t *boxDim,
elem_stride, // const cuuint32_t *elementStrides,
// Swizzling can be used to avoid shared memory bank conflicts.
CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_NONE,
);
After creating the tensor map, we can just pass the tensor map object to the kernel (device) as a parameter using __grid_constant__
// On the device
__global__ void kernel(const __grid_constant__ CUtensorMap tensor_map)
{
// Use tensor_map here.
}
//host
int main() {
CUtensorMap map;
// [ ..Initialize map.. ]
kernel<<<1, 1>>>(map);
}
The PTX instructions are device side instructions that initiate the tranfer of data. Some PTX instructions are:
cp.async.bulk.tensorcp.async.bulk.prefetch.tensorcp.async.bulk.commit_groupcp.async.bulk.wait_group NSee more on usage here
But you can use those PTX instructions through the CUDA c++ wrapper as cde::cp_async_bulk_tensor_2d_global_to_shared().
Barriers are a completion mechanism that waits for the data to be transferred before continuing execution. For this we can use the cuda::barrier class. First we need to declare the barrier in shared memory, and then initialize a thread to it, and then make the barrier visible to the async engine. Seems like hopper has multiple memory proxies (storing memory), so the line cde::fence_proxy_async_shared_cta() is used to tell the async engine to wait for the barriers. We then have to pass this barrier to the cp_async_bulk_tensor PTX instruction as a parameter, so when the TMA copy completes it can signal the barrier.
Next, once the TMA loads are done, the process arrives on the barrier. We want a way to ensure the TMA loads have been completed, so for this we can create an arrival token and set the expected amounts of bytes that have to be tranferred in order to consider the data tranfer as completed. That is done using the barrier_arrive_tx() function.
Another thing to note is that when transferring from global to shared memory, the completion mechanism used are share dmemory barriers, however when tranferring from shared memory back to global memory, we need to use the bulk-async group completion mechanism as cde::cp_async_bulk_wait_group_read<0>(). This is stated in this table from the docs here
Full code example on how to use this in practice can be found in the CUDA programming guide on page 248 and 249
swizzling in TMA, prefetching, example