A simple practice on matrix multiplication is shown in this post. The matrix product function can use multiple blocks to calculate multiplications of two matrix.
A simple matrix multiplication
The most important part is the kernel function, which is given below
1 | kernel_code_template = """ |
Note that the evaluation of C should be put in the conditional loop to guarentee that over-requested threads would not be invoked.
An odd bug
The code works well when the matrix size is less than 320*320 and requesting block size to be 32*32. But when the matrix size exceeds 320, like 321, the matrix product produced by GPU is not equal to the result by CPU. The difference between them is very tiny, like the scale of 1e-5. So far, I don’t quite understand where this bug comes from. Probabily it is due to the limit of the number of blocks in one grid?
Full code
It contains the example code and the speed test. Clone it here