自己写的cuda函数和cublas和ispc的对比
!!!代码在文末 !!!
开文废话
憋了这么久,终于开始写了。
预备知识
因为cublas的数据存储是按照列优先的,而c/c++是按行存储的。
行优先还是列优先
首先了解“行优先”和“列优先”的知识,这两种方式在数学上的直观描述如下,给定如下矩阵: 矩阵在逻辑上表达为2维的矩阵,M行K列,但存储到内存的时候都是按一维布局,其中按行优先存储和按列优先存储的差异如上图所示 如上图所示,当矩阵按行优先存储然后又按相反的列优先读取的话,就会得到原矩阵转置的结果;同理适用于按列优先存储然后按行优先读取。
例 cublasSgemm 函数
cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc)
cublasSgemm的官方API说明文档链接cublas
-
根据文档说可以知道,cublasSgemm完成了 C = alpha * op ( A ) * op ( B ) + beta * C 的矩阵乘、加运算。 -
其中alpha和beta是标量, A、 B、 C是以列优先存储的矩阵,A称为乘法左矩阵、B称为乘法右矩阵、C称为结果矩阵,当alpha = 1.0f 并且 beta =0.0f 的时候 cublasSgemm完成了计算: 结果矩阵= op (乘法左矩阵) * op ( 乘法右矩阵) 。 -
cublasOperation_t 该类型表明输入的密集矩阵的形式,其值有 CUBLAS_OP_N(非转置);CUBLAS_OP_T(转置); CUBLAS_OP_C(共轭转置)。该函数对应于BLAS(FORTRAN版)的变量字符’N’或’n’(非转置,即正常形式的矩阵),T’或’t’(转置矩阵);‘C’或’c’(共轭转置矩阵,对应的是复数矩阵。
求解C=AxB
其中(A为A_ROW行A_COL列 B为B_ROW行B_COL列 所以 C为A_ROW行B_COL列)
不使用cublasSgemm transa与transb参数
由于C/C++程序中输入的A和B是按行存储,所以在的情况下,cublas其实读取到的是A和B的转置矩阵AT和BT.
根据线性代数的规则可知CT = (A x B)T = BT x AT 所以cublasSgemm API中几个参数设置如下:
- 设置了cublasSgemm的transa与transb参数=CUBLAS_OP_N
- 乘法左矩阵为BT=参数设置为B,乘法右矩阵为AT=参数设置为A
- 结果矩阵的行数为CT的行数(正常c/c++矩阵的列数)= 参数设置为:B_COL
- 结果矩阵的列数为CT的列数(正常c/c++矩阵的行数)= 参数设置为:A_ROW
- 乘法左矩阵列与乘法右矩阵的行=参数设置为:B_ROW
- 按列优先读取乘法左矩阵B的主维度(即BT有几行、正常c/c++矩阵的列数)=参数设置为B_COL
- 按列优先读取乘法右矩阵A的主维度(即AT有几行、正常c/c++矩阵的列数)=参数设置为A_COL
- 结果矩阵存储在参数C中,它的主维度(即结果矩阵c有几行)= 参数设置为B_COL
cublasSgemm( handle, CUBLAS_OP_N, //矩阵A的属性参数,不转置,按列优先 CUBLAS_OP_N, //矩阵B的属性参数,不转置,按列优先 B_COL, //矩阵BT、CT的行数 A_ROW, //矩阵AT、CT的列数 B_ROW, //BT的列数,AT的行数,此处也可为A_COL,一样的 &a, //alpha的值 d_B, //左矩阵,为B^T B_COL, //B^T的leading dimension,按列优先,则leading dimension为B^T的行数(B的列数) d_A, //右矩阵,为A^T A_COL, //A^T的leading dimension,按列优先,则leading dimension为A^T的行数(A的列数) &b, //beta的值 d_C, //结果矩阵C B_COL //C^T的leading dimension,C^T矩阵一定按列优先,则leading dimension为C^T的行数(C的列数) );//此处的h_C是按列存储的CT
按上面的参数调用cublasSgemm API (矩阵A按行存储在指针d_A, 矩阵B按行存储在指针d_B, 矩阵C的存储空间指针d_C) 最后结果d_C传输到host端的h_C,从结果矩阵的存储空间h_C中按行读取到的就是C=AxB的结果,整个cublasSgemm的计算过程如下图所示。
部分示例代码:
template <typename T1, typename T2>
void cublas_kernel()
{
cublasHandle_t handle;
cublasCreate(&handle);
cudaEventCreate(&start);
cudaEventCreate(&stop);
T1* h_A, * h_B;
T2* h_C, * h_CC;
cudaHostAlloc((void**)&h_A, sizeof(T1) * A_ROW * A_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_B, sizeof(T1) * B_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_C, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_CC, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
MatrixINIT<T1>(A_ROW, A_COL, h_A);
MatrixINIT<T1>(B_ROW, B_COL, h_B);
#if defined(USE_FLOAT_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_FLOAT_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_DOUBLE_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_DOUBLE_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#endif
T1* d_A, * d_B;
T2* d_C;
cudaMalloc((void**)&d_A, sizeof(T1) * A_ROW * A_COL);
cudaMalloc((void**)&d_B, sizeof(T1) * B_ROW * B_COL);
cudaMalloc((void**)&d_C, sizeof(T2) * A_ROW * B_COL);
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cublasSetStream(handle, stream);
const T2 a = 1.0f, b = 0.0f;
TIMER_START(_X);
for (int i = 0; i < N ; i++)
{
cublasSetMatrixAsync(A_ROW, A_COL, sizeof(*h_A), h_A, A_ROW, d_A, A_ROW, stream);
cublasSetMatrixAsync(B_ROW, B_COL, sizeof(*h_B), h_B, B_ROW, d_B, B_ROW, stream);
cudaEventRecord(start, 0);
#if defined(USE_FLOAT_N)
cublasSgemm(
handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
B_COL,
A_ROW,
B_ROW,
&a,
d_B,
B_COL,
d_A,
A_COL,
&b,
d_C,
B_COL
);
#elif defined(USE_DOUBLE_N)
cublasDgemm(
handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
B_COL,
A_ROW,
B_ROW,
&a,
d_B,
B_COL,
d_A,
A_COL,
&b,
d_C,
B_COL
);
#endif
cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cublasGetMatrixAsync(A_ROW, B_COL, sizeof(*h_C), d_C, A_ROW, h_C, A_ROW, stream);
}
TIMER_STOP(_X);
cout << "cublas_kernel GPU传输、计算花费了: " << TIMER_MSEC(_X) << " ms " << "\n";
std::cout << "cublas_kernel GPU计算花费了:" << elapsedTime * N<< " ms" << std::endl<< std::endl;
#if defined(USE_FLOAT_N)
Matrixshow<T2>("计算结果C的值 ( C^T = (B^T*A^T) = (B*A)^T )", A_ROW, B_COL, h_C, 1, 0);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
#elif defined(USE_DOUBLE_N)
Matrixshow<T2>("计算结果C的值 ( C^T = (B^T*A^T) = (B*A)^T )", A_ROW, B_COL, h_C, 1, 0);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
#endif
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
cudaFreeHost(h_C);
cudaFreeHost(h_CC);
cublasDestroy(handle);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaStreamDestroy(stream);
}
使用cublasSgemm transa与transb参数
由于C/C++程序中输入的A和B是按行存储,所以在的情况下,cublas其实读取到的是A和B的转置矩阵A^T 和 B^T 设置了cublasSgemm的transa与transb参数后,可以在做矩阵运算前对读取到的A^T、 B^T矩阵做一次转置,获得A和B根据线性代数的规则可知C = A x B 所以cublasSgemm API中几个参数设置如下:
- 设置了cublasSgemm的transa与transb参数=CUBLAS_OP_T,在进行矩阵运算前对读取的矩阵做一次转置
- 乘法左矩阵为A=参数设置为A,乘法右矩阵为B=参数设置为B
- 结果矩阵的行数为C的行数=参数设置为 A_ROW
- 结果矩阵的列数为C的列数=参数设置为 B_COL
- 乘法左矩阵列与乘法右矩阵的行=参数设置为 A_COL
- 按列优先读取乘法左矩阵A的主维度(就变成了类似CUBLAS_OP_N的参数情况)(即A^T有几行)=参数设置为 A_COL
- 按列优先读取乘法右矩阵B的主维度(就变成了类似CUBLAS_OP_N的参数情况)(即B^T有几行)=参数设置为 B_COL
- 结果矩阵存储在参数C中,它的主维度(即结果矩阵c有几行)= 参数设置为 A_ROW
cublasSgemm( handle, CUBLAS_OP_T , //矩阵A的属性参数,还是按列优先读取,但是在计算前,转置,变成正常c/c++的方式 CUBLAS_OP_T , //矩阵B的属性参数,还是按列优先读取,但是在计算前,转置,变成正常c/c++的方式 A_ROW , //矩阵A、C的行数 B_COL , //矩阵B、C的列数 A_COL, //A的列数,B的行数,此处也可为B_ROW一样的 &a, //alpha的值 d_A , //左矩阵,为A A_COL , //A的leading dimension,按列优先,则leading dimension为(A^T的行数)A的列数 d_B , //右矩阵,为B B_COL , //B的leading dimension,按列优先,则leading dimension为(B^T的行数)A的列数 &b, //beta的值 d_C, //结果矩阵C A_ROW //C的leading dimension,C矩阵一定按列优先,则leading dimension为C的行数 );//此处的h_C是按列存储的C
红色的参数标记出与“不使用cublasSgemm transa与transb参数”例子中的不同,按上面的参数调用cublasSgemm API (矩阵A按行存储在指针d_a, 矩阵B按行存储在指针d_b, 矩阵C的存储空间指针d_c) 最后从结果矩阵的存储空间d_c中按行读取到的就是C=AxB后C^T 的结果,所以在C/C++程序中还需要对读取的结果C^T做一次矩阵转置操作才能获得最终正确的C。整个cublasSgemm的计算过程如下图所示: 结果:
部分示例代码
template <typename T1, typename T2>
void cublas_kernel()
{
cublasHandle_t handle;
cublasCreate(&handle);
cudaEventCreate(&start);
cudaEventCreate(&stop);
T1* h_A, * h_B;
T2* h_C, * h_CC;
cudaHostAlloc((void**)&h_A, sizeof(T1) * A_ROW * A_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_B, sizeof(T1) * B_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_C, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_CC, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
MatrixINIT<T1>(A_ROW, A_COL, h_A);
MatrixINIT<T1>(B_ROW, B_COL, h_B);
#if defined(USE_FLOAT_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 1);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 1);
#elif defined(USE_FLOAT_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_DOUBLE_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_DOUBLE_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#endif
T1* d_A, * d_B;
T2* d_C;
cudaMalloc((void**)&d_A, sizeof(T1) * A_ROW * A_COL);
cudaMalloc((void**)&d_B, sizeof(T1) * B_ROW * B_COL);
cudaMalloc((void**)&d_C, sizeof(T2) * A_ROW * B_COL);
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cublasSetStream(handle, stream);
const T2 a = 1.0f, b = 0.0f;
TIMER_START(_X);
for (int i = 0; i < N ; i++)
{
cublasSetMatrixAsync(A_ROW, A_COL, sizeof(*h_A), h_A, A_ROW, d_A, A_ROW, stream);
cublasSetMatrixAsync(B_ROW, B_COL, sizeof(*h_B), h_B, B_ROW, d_B, B_ROW, stream);
cudaEventRecord(start, 0);
#if defined(USE_FLOAT_T)
cublasSgemm(
handle,
CUBLAS_OP_T,
CUBLAS_OP_T,
A_ROW,
B_COL,
A_COL,
&a,
d_A,
A_COL,
d_B,
B_COL,
&b,
d_C,
A_ROW
);
#elif defined(USE_DOUBLE_T)
cublasDgemm(
handle,
CUBLAS_OP_T,
CUBLAS_OP_T,
A_ROW,
B_COL,
A_COL,
&a,
d_A,
A_COL,
d_B,
B_COL,
&b,
d_C,
A_ROW
);
#endif
cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cublasGetMatrixAsync(A_ROW, B_COL, sizeof(*h_C), d_C, A_ROW, h_C, A_ROW, stream);
}
TIMER_STOP(_X);
cout << "cublas_kernel GPU传输、计算花费了: " << TIMER_MSEC(_X) << " ms " << "\n";
std::cout << "cublas_kernel GPU计算花费了:" << elapsedTime * N<< " ms" << std::endl<< std::endl;
#if defined(USE_FLOAT_T)
Matrixshow<T2>("计算结果C的转置的值 ( C = A*B )", A_ROW, B_COL, h_C, 0);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 1);
#endif
#elif defined(USE_DOUBLE_T)
Matrixshow<T2>("计算结果C的转置的值 ( C = A*B )", A_ROW, B_COL, h_C, 0);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 1);
#endif
#endif
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
cudaFreeHost(h_C);
cudaFreeHost(h_CC);
cublasDestroy(handle);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaStreamDestroy(stream);
}
cublasGemmEx的使用
但是GPU的内存一般比较有限,实际项目中,为了能够做更大的矩阵乘法,使用FP32型数据占用空间会很大,所以可以压缩为int8型数据进行矩阵乘法运算,这时使用的接口模型为:
cublasStatus_t cublasGemmEx(cublasHandle_t handle, //句柄 cublasOperation_t transa, //a矩阵转置选项 cublasOperation_t transb, //b矩阵转置选项 int m, //a矩阵行数 int n, //b矩阵列数 int k, //a矩阵列数兼b矩阵行数 const void *alpha, //乘法因子alpha const void *A, //A矩阵 cudaDataType_t Atype, //A矩阵的数据类型 int lda, //A矩阵的列数 const void *B, //B矩阵 cudaDataType_t Btype, //B矩阵的数据类型 int ldb, //B矩阵的行数 const void *beta, //乘法因子beta void *C, //C结果矩阵 cudaDataType_t Ctype, //C结果矩阵数据类型 int ldc, //C矩阵的行数 cudaDataType_t computeType, //计算模式 cublasGemmAlgo_t algo)
该矩阵接口和cublasSgemm几乎相同,但多了几个参数:cudaDataType_t Atype A矩阵数据类型、cudaDataType_t Btype B矩阵数据类型、cudaDataType_t Ctype C矩阵数据类型以及cudaDataType_t computeType 计算模式和cublasGemmAlgo_t参数。首先来说,这个接口可以完成的运算表达式如下:
C=αop(A)op(B)+βC 其中,α和β为标量,A、B、C分别是m×k、k×n、m×n的矩阵,当α = 1, β = 0时,即为:
C=op(A)op(B)
也就是做矩阵乘法。α和β分别对应cublasGemmEx中的const void *alpha、const void *beta,这一点与前面cublasSegmm做FP32类型的接口是一样的。
在接口中,cublasOperation_t transa矩阵转置选项有如下三种:
-
transa == CUBLAS_OP_N表示A矩阵,无任何转置。 -
transa == CUBLAS_OP_T表示A^T,A的转置。 -
transa == CUBLAS_OP_C表示A^H,A的共轭转置。
同理,B矩阵也是一样。
在接口中,新增的cudaDataType_t Atype A矩阵数据类型、cudaDataType_t Btype B矩阵数据类型、cudaDataType_t Ctype C矩阵数据类型以及cudaDataType_t computeType这几个参数的取值情况如下表所示: 我们这里选择的int8型矩阵乘法,最好选择Atype/Btype为CUDA_R_8I型,计算模式CUDA_R_32I,乘出的结果Ctype为CUDA_R_32I类型。最后的cublasGemmAlgo_t参数可以选择的值如下表所示: 这里可以选择CUBLAS_GEMM_DEFAULT,应用启发式选择GEMM算法。
部分示例代码
cublasGemmEX函数做int8型矩阵乘法实现:
template <typename T1, typename T2>
void cublas_kernel()
{
cublasHandle_t handle;
cublasCreate(&handle);
cudaEventCreate(&start);
cudaEventCreate(&stop);
T1* h_A, * h_B;
T2* h_C, * h_CC;
cudaHostAlloc((void**)&h_A, sizeof(T1) * A_ROW * A_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_B, sizeof(T1) * B_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_C, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_CC, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
MatrixINIT<T1>(A_ROW, A_COL, h_A);
MatrixINIT<T1>(B_ROW, B_COL, h_B);
#if defined(USE_INT8_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0, 0, "char");
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0, 0,"char");
#elif defined(USE_INT8_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0, 0, "char");
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0, 0, "char");
#endif
T1* d_A, * d_B;
T2* d_C;
cudaMalloc((void**)&d_A, sizeof(T1) * A_ROW * A_COL);
cudaMalloc((void**)&d_B, sizeof(T1) * B_ROW * B_COL);
cudaMalloc((void**)&d_C, sizeof(T2) * A_ROW * B_COL);
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cublasSetStream(handle, stream);
const T2 a = 1.0f, b = 0.0f;
TIMER_START(_X);
for (int i = 0; i < N ; i++)
{
cublasSetMatrixAsync(A_ROW, A_COL, sizeof(*h_A), h_A, A_ROW, d_A, A_ROW, stream);
cublasSetMatrixAsync(B_ROW, B_COL, sizeof(*h_B), h_B, B_ROW, d_B, B_ROW, stream);
cudaEventRecord(start, 0);
#if defined(USE_INT8_T)
cublasGemmEx(handle,
CUBLAS_OP_T,
CUBLAS_OP_T,
A_ROW,
B_COL,
A_COL,
&a,
d_A,
CUDA_R_8I,
A_COL,
d_B,
CUDA_R_8I,
B_COL,
&b,
d_C,
CUDA_R_32I,
A_ROW,
CUDA_R_32I,
CUBLAS_GEMM_DFALT
);
#endif
cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cublasGetMatrixAsync(A_ROW, B_COL, sizeof(*h_C), d_C, A_ROW, h_C, A_ROW, stream);
}
TIMER_STOP(_X);
cout << "cublas_kernel GPU传输、计算花费了: " << TIMER_MSEC(_X) << " ms " << "\n";
std::cout << "cublas_kernel GPU计算花费了:" << elapsedTime * N<< " ms" << std::endl<< std::endl;
#if defined(USE_INT8_T)
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 1);
#endif
#elif defined(USE_INT8_N)
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
#endif
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
cudaFreeHost(h_C);
cudaFreeHost(h_CC);
cublasDestroy(handle);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaStreamDestroy(stream);
}
当运行有误时,可以打印cublasGemmEx的返回值来确认是什么错误,官方定义的错误码如下表所示: 成功时会返回0,也就是CUBLAS_STATUS_SUCCESS。多数情况会返回CUBLAS_STATUS_INVALID_VALUE,这就是m,n,k参数传入不正确所致。
运行结果
注意
报错:
On entry to GEMM_EX parameter 12 had a illegal value
原因是:这个函数的 lda, ldb 参数(主维度)只能是 4 的整数倍 。
整合一下前边出现的代码
template <typename T1, typename T2>
void cublas_kernel()
{
cublasHandle_t handle;
cublasCreate(&handle);
cudaEventCreate(&start);
cudaEventCreate(&stop);
T1* h_A, * h_B;
T2* h_C, * h_CC;
cudaHostAlloc((void**)&h_A, sizeof(T1) * A_ROW * A_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_B, sizeof(T1) * B_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_C, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_CC, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
MatrixINIT<T1>(A_ROW, A_COL, h_A);
MatrixINIT<T1>(B_ROW, B_COL, h_B);
#if defined(USE_FLOAT_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 1);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 1);
#elif defined(USE_FLOAT_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_DOUBLE_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_DOUBLE_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0);
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0);
#elif defined(USE_INT8_T)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0, 0, "char");
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0, 0, "char");
#elif defined(USE_INT8_N)
Matrixshow<T1>("A", A_ROW, A_COL, h_A, 0, 0, "char");
Matrixshow<T1>("B", B_ROW, B_COL, h_B, 0, 0, "char");
#endif
T1* d_A, * d_B;
T2* d_C;
cudaMalloc((void**)&d_A, sizeof(T1) * A_ROW * A_COL);
cudaMalloc((void**)&d_B, sizeof(T1) * B_ROW * B_COL);
cudaMalloc((void**)&d_C, sizeof(T2) * A_ROW * B_COL);
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cublasSetStream(handle, stream);
const T2 a = 1.0f, b = 0.0f;
TIMER_START(_X);
for (int i = 0; i < N ; i++)
{
cublasSetMatrixAsync(A_ROW, A_COL, sizeof(*h_A), h_A, A_ROW, d_A, A_ROW, stream);
cublasSetMatrixAsync(B_ROW, B_COL, sizeof(*h_B), h_B, B_ROW, d_B, B_ROW, stream);
cudaEventRecord(start, 0);
#if defined(USE_FLOAT_N)
cublasSgemm(
handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
B_COL,
A_ROW,
B_ROW,
&a,
d_B,
B_COL,
d_A,
A_COL,
&b,
d_C,
B_COL
);
#elif defined(USE_FLOAT_T)
cublasSgemm(
handle,
CUBLAS_OP_T,
CUBLAS_OP_T,
A_ROW,
B_COL,
A_COL,
&a,
d_A,
A_COL,
d_B,
B_COL,
&b,
d_C,
A_ROW
);
#elif defined(USE_DOUBLE_N)
cublasDgemm(
handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
B_COL,
A_ROW,
B_ROW,
&a,
d_B,
B_COL,
d_A,
A_COL,
&b,
d_C,
B_COL
);
#elif defined(USE_DOUBLE_T)
cublasDgemm(
handle,
CUBLAS_OP_T,
CUBLAS_OP_T,
A_ROW,
B_COL,
A_COL,
&a,
d_A,
A_COL,
d_B,
B_COL,
&b,
d_C,
A_ROW
);
#elif defined(USE_INT8_N)
cublasGemmEx(handle,
CUBLAS_OP_N,
CUBLAS_OP_N,
B_COL,
A_ROW,
B_ROW,
&a,
d_B,
CUDA_R_8I,
B_COL,
d_A,
CUDA_R_8I,
A_COL,
&b,
d_C,
CUDA_R_32I,
B_COL,
CUDA_R_32I,
CUBLAS_GEMM_DFALT
);
#elif defined(USE_INT8_T)
cublasGemmEx(handle,
CUBLAS_OP_T,
CUBLAS_OP_T,
A_ROW,
B_COL,
A_COL,
&a,
d_A,
CUDA_R_8I,
A_COL,
d_B,
CUDA_R_8I,
B_COL,
&b,
d_C,
CUDA_R_32I,
A_ROW,
CUDA_R_32I,
CUBLAS_GEMM_DFALT
);
#endif
cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cublasGetMatrixAsync(A_ROW, B_COL, sizeof(*h_C), d_C, A_ROW, h_C, A_ROW, stream);
}
TIMER_STOP(_X);
#if defined(USE_FLOAT_T)
Matrixshow<T2>("计算结果C的转置的值 ( C = A*B )", A_ROW, B_COL, h_C, 0, 1);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 1);
#endif
#elif defined(USE_FLOAT_N)
Matrixshow<T2>("计算结果C的值 ( C^T = (B^T*A^T) = (B*A)^T )", A_ROW, B_COL, h_C, 0, 0);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
#elif defined(USE_DOUBLE_T)
Matrixshow<T2>("计算结果C的转置的值 ( C = A*B )", A_ROW, B_COL, h_C, 0, 1);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 1);
#endif
#elif defined(USE_DOUBLE_N)
Matrixshow<T2>("计算结果C的值 ( C^T = (B^T*A^T) = (B*A)^T )", A_ROW, B_COL, h_C, 0, 0);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
#elif defined(USE_INT8_T)
Matrixshow<T2>("计算结果C的转置的值 ( C = A*B )", A_ROW, B_COL, h_C, 0, 1);
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 1);
#endif
#elif defined(USE_INT8_N)
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
#endif
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFreeHost(h_A);
cudaFreeHost(h_B);
cudaFreeHost(h_C);
cudaFreeHost(h_CC);
cublasDestroy(handle);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaStreamDestroy(stream);
}
ISPC的矩阵代码
ISPC的一些简单介绍和配置和使用请参考之前记录的ISPC的专栏文章。 下面的代码是根据ISPC提供的参考代码修改的。
这个好像不支持c++的模板函数,所以写起来感觉可能很多。
SGEMM_kernels_ispc.ispc
#define TILE_SIZE 32
export uniform int SGEMM_get_program_count() {
return programCount;
}
export uniform int SGEMM_get_tile_size() {
return TILE_SIZE;
}
#define TILE_HEIGHT 2
#define TILE_WIDTH 32
export void SGEMM_tileBlockNoSIMDIntrin_2_int(uniform int matrixA[], uniform int matrixB[], uniform int matrixC[],
uniform int M, uniform int N, uniform int K) {
uniform int sumTile[TILE_HEIGHT][TILE_WIDTH];
uniform int oneAVal[TILE_HEIGHT];
for (uniform unsigned int m = 0; m < M; m += TILE_HEIGHT) {
for (uniform unsigned int k0 = 0; k0 < K; k0 += TILE_WIDTH) {
foreach (ki = 0 ... TILE_WIDTH) {
sumTile[0][ki] = 0;
sumTile[1][ki] = 0;
}
for (uniform unsigned int n = 0; n < N; n++) {
uniform unsigned int mTimesNPlusN = (m + 0)*N + n;
uniform unsigned int mPlusOneTimesNPlusN = (m + 1)*N + n;
prefetch_nt(&matrixA[mPlusOneTimesNPlusN]);
prefetch_nt(&matrixA[mTimesNPlusN]);
oneAVal[0] = matrixA[mTimesNPlusN];
oneAVal[1] = matrixA[mPlusOneTimesNPlusN];
uniform unsigned int nTimesKPlusk0 = n*K + k0;
foreach (kt = 0 ... TILE_WIDTH) {
varying int matB1 = matrixB[nTimesKPlusk0 + kt];
sumTile[0][kt] += oneAVal[0] * matB1;
sumTile[1][kt] += oneAVal[1] * matB1;
}
}
uniform unsigned int mTimesKPlusK0 = (m + 0)*K + k0;
uniform unsigned int mPlusOneTimesKPlusK0 = (m + 1)*K + k0;
foreach (ki = 0 ... TILE_WIDTH) {
matrixC[mTimesKPlusK0 + ki] = sumTile[0][ki];
matrixC[mPlusOneTimesKPlusK0 + ki] = sumTile[1][ki];
}
}
}
}
task void SGEMM_tileBlockNoSIMDIntrin_2_task_int(uniform int matrixA[], uniform int matrixB[], uniform int matrixC[],
uniform int M, uniform int N, uniform int K) {
uniform int sumTile[TILE_HEIGHT][TILE_WIDTH];
uniform int oneAVal[TILE_HEIGHT];
uniform unsigned int uNumRowsPerTask = M / taskCount;
uniform unsigned int uRowStart = uNumRowsPerTask * taskIndex;
uniform unsigned int uRowEnd = uRowStart + uNumRowsPerTask;
for (uniform unsigned int m = uRowStart; m < uRowEnd; m += TILE_HEIGHT) {
for (uniform unsigned int k0 = 0; k0 < K; k0 += TILE_WIDTH) {
foreach (ki = 0 ... TILE_WIDTH) {
sumTile[0][ki] = 0;
sumTile[1][ki] = 0;
}
for (uniform unsigned int n = 0; n < N; n++) {
uniform unsigned int mTimesNPlusN = (m + 0)*N + n;
uniform unsigned int mPlusOneTimesNPlusN = (m + 1)*N + n;
prefetch_nt(&matrixA[mPlusOneTimesNPlusN]);
prefetch_nt(&matrixA[mTimesNPlusN]);
oneAVal[0] = matrixA[mTimesNPlusN];
oneAVal[1] = matrixA[mPlusOneTimesNPlusN];
uniform unsigned int nTimesKPlusk0 = n*K + k0;
foreach (kt = 0 ... TILE_WIDTH) {
varying int matB1 = matrixB[nTimesKPlusk0 + kt];
sumTile[0][kt] += oneAVal[0] * matB1;
sumTile[1][kt] += oneAVal[1] * matB1;
}
}
uniform unsigned int mTimesKPlusK0 = (m + 0)*K + k0;
uniform unsigned int mPlusOneTimesKPlusK0 = (m + 1)*K + k0;
foreach (ki = 0 ... TILE_WIDTH) {
matrixC[mTimesKPlusK0 + ki] = sumTile[0][ki];
matrixC[mPlusOneTimesKPlusK0 + ki] = sumTile[1][ki];
}
}
}
}
export void SGEMM_tileBlockNoSIMDIntrin_2_withTasks_int(uniform int matA[], uniform int matB[], uniform int matC[],
uniform int M, uniform int N, uniform int K) {
uniform int numTasks = M / programCount;
launch[numTasks] SGEMM_tileBlockNoSIMDIntrin_2_task_int(matA, matB, matC, M, N, K);
}
export void SGEMM_tileBlockNoSIMDIntrin_2_float(uniform float matrixA[], uniform float matrixB[], uniform float matrixC[],
uniform int M, uniform int N, uniform int K) {
uniform float sumTile[TILE_HEIGHT][TILE_WIDTH];
uniform float oneAVal[TILE_HEIGHT];
for (uniform unsigned int m = 0; m < M; m += TILE_HEIGHT) {
for (uniform unsigned int k0 = 0; k0 < K; k0 += TILE_WIDTH) {
foreach (ki = 0 ... TILE_WIDTH) {
sumTile[0][ki] = 0.0f;
sumTile[1][ki] = 0.0f;
}
for (uniform unsigned int n = 0; n < N; n++) {
uniform unsigned int mTimesNPlusN = (m + 0)*N + n;
uniform unsigned int mPlusOneTimesNPlusN = (m + 1)*N + n;
prefetch_nt(&matrixA[mPlusOneTimesNPlusN]);
prefetch_nt(&matrixA[mTimesNPlusN]);
oneAVal[0] = matrixA[mTimesNPlusN];
oneAVal[1] = matrixA[mPlusOneTimesNPlusN];
uniform unsigned int nTimesKPlusk0 = n*K + k0;
foreach (kt = 0 ... TILE_WIDTH) {
varying float matB1 = matrixB[nTimesKPlusk0 + kt];
sumTile[0][kt] += oneAVal[0] * matB1;
sumTile[1][kt] += oneAVal[1] * matB1;
}
}
uniform unsigned int mTimesKPlusK0 = (m + 0)*K + k0;
uniform unsigned int mPlusOneTimesKPlusK0 = (m + 1)*K + k0;
foreach (ki = 0 ... TILE_WIDTH) {
matrixC[mTimesKPlusK0 + ki] = sumTile[0][ki];
matrixC[mPlusOneTimesKPlusK0 + ki] = sumTile[1][ki];
}
}
}
}
task void SGEMM_tileBlockNoSIMDIntrin_2_task_float(uniform float matrixA[], uniform float matrixB[], uniform float matrixC[],
uniform int M, uniform int N, uniform int K) {
uniform float sumTile[TILE_HEIGHT][TILE_WIDTH];
uniform float oneAVal[TILE_HEIGHT];
uniform unsigned int uNumRowsPerTask = M / taskCount;
uniform unsigned int uRowStart = uNumRowsPerTask * taskIndex;
uniform unsigned int uRowEnd = uRowStart + uNumRowsPerTask;
for (uniform unsigned int m = uRowStart; m < uRowEnd; m += TILE_HEIGHT) {
for (uniform unsigned int k0 = 0; k0 < K; k0 += TILE_WIDTH) {
foreach (ki = 0 ... TILE_WIDTH) {
sumTile[0][ki] = 0.0f;
sumTile[1][ki] = 0.0f;
}
for (uniform unsigned int n = 0; n < N; n++) {
uniform unsigned int mTimesNPlusN = (m + 0)*N + n;
uniform unsigned int mPlusOneTimesNPlusN = (m + 1)*N + n;
prefetch_nt(&matrixA[mPlusOneTimesNPlusN]);
prefetch_nt(&matrixA[mTimesNPlusN]);
oneAVal[0] = matrixA[mTimesNPlusN];
oneAVal[1] = matrixA[mPlusOneTimesNPlusN];
uniform unsigned int nTimesKPlusk0 = n*K + k0;
foreach (kt = 0 ... TILE_WIDTH) {
varying float matB1 = matrixB[nTimesKPlusk0 + kt];
sumTile[0][kt] += oneAVal[0] * matB1;
sumTile[1][kt] += oneAVal[1] * matB1;
}
}
uniform unsigned int mTimesKPlusK0 = (m + 0)*K + k0;
uniform unsigned int mPlusOneTimesKPlusK0 = (m + 1)*K + k0;
foreach (ki = 0 ... TILE_WIDTH) {
matrixC[mTimesKPlusK0 + ki] = sumTile[0][ki];
matrixC[mPlusOneTimesKPlusK0 + ki] = sumTile[1][ki];
}
}
}
}
export void SGEMM_tileBlockNoSIMDIntrin_2_withTasks_float(uniform float matA[], uniform float matB[], uniform float matC[],
uniform int M, uniform int N, uniform int K) {
uniform int numTasks = M / programCount;
launch[numTasks] SGEMM_tileBlockNoSIMDIntrin_2_task_float(matA, matB, matC, M, N, K);
}
export void SGEMM_tileBlockNoSIMDIntrin_2_double(uniform double matrixA[], uniform double matrixB[], uniform double matrixC[],
uniform int M, uniform int N, uniform int K) {
uniform double sumTile[TILE_HEIGHT][TILE_WIDTH];
uniform double oneAVal[TILE_HEIGHT];
for (uniform unsigned int m = 0; m < M; m += TILE_HEIGHT) {
for (uniform unsigned int k0 = 0; k0 < K; k0 += TILE_WIDTH) {
foreach (ki = 0 ... TILE_WIDTH) {
sumTile[0][ki] = 0.0d;
sumTile[1][ki] = 0.0d;
}
for (uniform unsigned int n = 0; n < N; n++) {
uniform unsigned int mTimesNPlusN = (m + 0)*N + n;
uniform unsigned int mPlusOneTimesNPlusN = (m + 1)*N + n;
prefetch_nt(&matrixA[mPlusOneTimesNPlusN]);
prefetch_nt(&matrixA[mTimesNPlusN]);
oneAVal[0] = matrixA[mTimesNPlusN];
oneAVal[1] = matrixA[mPlusOneTimesNPlusN];
uniform unsigned int nTimesKPlusk0 = n*K + k0;
foreach (kt = 0 ... TILE_WIDTH) {
varying double matB1 = matrixB[nTimesKPlusk0 + kt];
sumTile[0][kt] += oneAVal[0] * matB1;
sumTile[1][kt] += oneAVal[1] * matB1;
}
}
uniform unsigned int mTimesKPlusK0 = (m + 0)*K + k0;
uniform unsigned int mPlusOneTimesKPlusK0 = (m + 1)*K + k0;
foreach (ki = 0 ... TILE_WIDTH) {
matrixC[mTimesKPlusK0 + ki] = sumTile[0][ki];
matrixC[mPlusOneTimesKPlusK0 + ki] = sumTile[1][ki];
}
}
}
}
task void SGEMM_tileBlockNoSIMDIntrin_2_task_double(uniform double matrixA[], uniform double matrixB[], uniform double matrixC[],
uniform int M, uniform int N, uniform int K) {
uniform double sumTile[TILE_HEIGHT][TILE_WIDTH];
uniform double oneAVal[TILE_HEIGHT];
uniform unsigned int uNumRowsPerTask = M / taskCount;
uniform unsigned int uRowStart = uNumRowsPerTask * taskIndex;
uniform unsigned int uRowEnd = uRowStart + uNumRowsPerTask;
for (uniform unsigned int m = uRowStart; m < uRowEnd; m += TILE_HEIGHT) {
for (uniform unsigned int k0 = 0; k0 < K; k0 += TILE_WIDTH) {
foreach (ki = 0 ... TILE_WIDTH) {
sumTile[0][ki] = 0.0d;
sumTile[1][ki] = 0.0d;
}
for (uniform unsigned int n = 0; n < N; n++) {
uniform unsigned int mTimesNPlusN = (m + 0)*N + n;
uniform unsigned int mPlusOneTimesNPlusN = (m + 1)*N + n;
prefetch_nt(&matrixA[mPlusOneTimesNPlusN]);
prefetch_nt(&matrixA[mTimesNPlusN]);
oneAVal[0] = matrixA[mTimesNPlusN];
oneAVal[1] = matrixA[mPlusOneTimesNPlusN];
uniform unsigned int nTimesKPlusk0 = n*K + k0;
foreach (kt = 0 ... TILE_WIDTH) {
varying double matB1 = matrixB[nTimesKPlusk0 + kt];
sumTile[0][kt] += oneAVal[0] * matB1;
sumTile[1][kt] += oneAVal[1] * matB1;
}
}
uniform unsigned int mTimesKPlusK0 = (m + 0)*K + k0;
uniform unsigned int mPlusOneTimesKPlusK0 = (m + 1)*K + k0;
foreach (ki = 0 ... TILE_WIDTH) {
matrixC[mTimesKPlusK0 + ki] = sumTile[0][ki];
matrixC[mPlusOneTimesKPlusK0 + ki] = sumTile[1][ki];
}
}
}
}
export void SGEMM_tileBlockNoSIMDIntrin_2_withTasks_double(uniform double matA[], uniform double matB[], uniform double matC[],
uniform int M, uniform int N, uniform int K) {
uniform int numTasks = M / programCount;
launch[numTasks] SGEMM_tileBlockNoSIMDIntrin_2_task_double(matA, matB, matC, M, N, K);
}
ispc.hpp
#include <device_launch_parameters.h>
#include "cuda_runtime.h"
#include "cublas_v2.h"
#include <iostream>
#include <stdlib.h>
#include <math.h>
#include "SGEMM_kernels_ispc.h"
#include "matrix.hpp"
bool tasks;
using namespace std;
using namespace ispc;
typedef void (*SGEMMFuncPtr)(void);
typedef void (*SGEMMFuncPtr_SingleThreaded_int)(int* matrixA, int* matrixB, int* matrixC, unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL);
typedef void (*SGEMMFuncPtr_MultiThreaded_int)(int* matrixA, int* matrixB, int* matrixC, unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL);
typedef void (*SGEMMFuncPtr_SingleThreaded_float)(float* matrixA, float* matrixB, float* matrixC, unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL);
typedef void (*SGEMMFuncPtr_MultiThreaded_float)(float* matrixA, float* matrixB, float* matrixC, unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL);
typedef void (*SGEMMFuncPtr_SingleThreaded_double)(double* matrixA, double* matrixB, double* matrixC, unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL);
typedef void (*SGEMMFuncPtr_MultiThreaded_double)(double* matrixA, double* matrixB, double* matrixC, unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL);
template <typename T1>
void Test_SGEMM(SGEMMFuncPtr SGEMMFunc, T1* matrixA, T1* matrixB, T1* matrixC,
unsigned int A_ROW, unsigned int A_COL, unsigned int B_COL, bool task) {
unsigned int i;
if (task) {
#if defined(USE_ISPC_INT)
auto SGEMMFunc_MT = (SGEMMFuncPtr_MultiThreaded_int)SGEMMFunc;
#elif defined(USE_ISPC_FLOAT)
auto SGEMMFunc_MT = (SGEMMFuncPtr_MultiThreaded_float)SGEMMFunc;
#elif defined(USE_ISPC_DOUBLE)
auto SGEMMFunc_MT = (SGEMMFuncPtr_MultiThreaded_double)SGEMMFunc;
#endif
for (i = 0; i < N; i++) {
SGEMMFunc_MT(matrixA, matrixB, matrixC, A_ROW, A_COL, B_COL);
}
}
else {
#if defined(USE_ISPC_INT)
auto SGEMMFunc_ST = (SGEMMFuncPtr_SingleThreaded_int)SGEMMFunc;
#elif defined(USE_ISPC_FLOAT)
auto SGEMMFunc_ST = (SGEMMFuncPtr_SingleThreaded_float)SGEMMFunc;
#elif defined(USE_ISPC_DOUBLE)
auto SGEMMFunc_ST = (SGEMMFuncPtr_SingleThreaded_double)SGEMMFunc;
#endif
for (i = 0; i < N; i++)
{
SGEMMFunc_ST(matrixA, matrixB, matrixC, A_ROW, A_COL, B_COL);
}
}
}
template <typename T1, typename T2>
void cpu_ispc() {
int programCount = SGEMM_get_program_count();
int tileSize = SGEMM_get_tile_size();
if (B_COL % programCount != 0 || B_COL % tileSize != 0) {
printf("\nNumber of columns in Matrix B (K) must be a multiple of %d (target width) and %d (tile size)!\n",
programCount, tileSize);
exit(-1);
}
if (A_ROW % programCount != 0) {
printf("\nNumber of rows in Matrix A (M) must be a multiple of %d (target width)!\n", programCount);
exit(-1);
}
if (A_COL % programCount != 0) {
printf("\nNumber of columns in Matrix A (N), which is also number of rows in Matrix B, "
"must be a multiple of %d (target width)!\n",
programCount);
exit(-1);
}
T1* h_A, * h_B;
T2* h_C, * h_CC;
h_A = (T1*)malloc(sizeof(T1) * A_ROW * A_COL);
h_B = (T1*)malloc(sizeof(T1) * B_ROW * B_COL);
h_C = (T2*)malloc(sizeof(T2) * A_ROW * B_COL);
h_CC = (T2*)malloc(sizeof(T2) * A_ROW * B_COL);
MatrixINIT<T1>(A_ROW, A_COL, h_A);
MatrixINIT<T1>(B_ROW, B_COL, h_B);
tasks = false;
TIMER_START(t);
#if defined(USE_ISPC_INT)
Test_SGEMM<T1>((SGEMMFuncPtr)SGEMM_tileBlockNoSIMDIntrin_2_int, h_A, h_B,
h_C, A_ROW, A_COL, B_COL, tasks);
#elif defined(USE_ISPC_FLOAT)
Test_SGEMM<T1>((SGEMMFuncPtr)SGEMM_tileBlockNoSIMDIntrin_2_float, h_A, h_B,
h_C, A_ROW, A_COL, B_COL, tasks);
#elif defined(USE_ISPC_DOUBLE)
Test_SGEMM<T1>((SGEMMFuncPtr)SGEMM_tileBlockNoSIMDIntrin_2_double, h_A, h_B,
h_C, A_ROW, A_COL, B_COL, tasks);
#endif
TIMER_STOP(t);
cout << "ISPC单任务花费了:" << TIMER_MSEC(t) << " ms " << endl << endl;
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
tasks = true;
TIMER_START(X);
#if defined(USE_ISPC_INT)
Test_SGEMM<T1>((SGEMMFuncPtr)SGEMM_tileBlockNoSIMDIntrin_2_withTasks_int,
h_A, h_B, h_C, A_ROW, A_COL, B_COL, tasks);
#elif defined(USE_ISPC_FLOAT)
Test_SGEMM<T1>((SGEMMFuncPtr)SGEMM_tileBlockNoSIMDIntrin_2_withTasks_float,
h_A, h_B, h_C, A_ROW, A_COL, B_COL, tasks);
#elif defined(USE_ISPC_DOUBLE)
Test_SGEMM<T1>((SGEMMFuncPtr)SGEMM_tileBlockNoSIMDIntrin_2_withTasks_double,
h_A, h_B, h_C, A_ROW, A_COL, B_COL, tasks);
#endif
TIMER_STOP(X);
cout << "ISPC多任务花费了:" << TIMER_MSEC(X) << " ms " << endl << endl;
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_A, h_B, A_ROW, A_COL, B_COL, h_C, h_CC, 0);
#endif
free(h_A);
free(h_B);
free(h_C);
free(h_CC);
}
自己写的cuda矩阵计算核函数
使用了共享内存进行优化,一些优化的介绍查看之前记录的cuda。
template <typename T1, typename T2>
__global__ void MatrixMulCUDA(const T1* A, const T1 * B, T2* C,
const int ROW_A, const int COL_A, const int ROW_B, const int COL_B)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = blockIdx.y * BLOCK_SIZE + ty;
int col = blockIdx.x * BLOCK_SIZE + tx;
int I = (COL_A + BLOCK_SIZE - 1) / BLOCK_SIZE;
T2 t=0.0f, Csub = 0.0f, comp = 0.0f;
__shared__ float As[BLOCK_SIZE+1][BLOCK_SIZE+1];
__shared__ float Bs[BLOCK_SIZE+1][BLOCK_SIZE+1];
for (int i = 0; i < I; i++) {
if (row < ROW_A && i * BLOCK_SIZE + tx < COL_A)
As[ty][tx] = A[row * COL_A + i * BLOCK_SIZE + tx];
else
As[ty][tx] = 0;
if (col < COL_B && i * BLOCK_SIZE + ty < ROW_B)
Bs[ty][tx] = B[(i * BLOCK_SIZE + ty) * COL_B + col];
else
Bs[ty][tx] = 0;
__syncthreads();
for (int j = 0; j < BLOCK_SIZE; ++j)
{
comp -= As[ty][j] * Bs[j][tx];
t = Csub - comp;
comp = (t - Csub) + comp;
Csub = t;
}
__syncthreads();
}
if (row < ROW_A && col < COL_B)
{
C[row * COL_B + col] = Csub;
}
}
template <typename T1, typename T2>
void My_kernel()
{
cudaEventCreate(&start);
cudaEventCreate(&stop);
T1* h_a, * h_b, * h_c, * h_cc;
cudaHostAlloc((void**)&h_a, sizeof(T1) * A_ROW * A_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_b, sizeof(T1) * B_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_c, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
cudaHostAlloc((void**)&h_cc, sizeof(T2) * A_ROW * B_COL, cudaHostAllocDefault);
MatrixINIT<T1>(A_ROW, A_COL, h_a);
MatrixINIT<T1>(B_ROW, B_COL, h_b);
T1* d_a, * d_b;
T2* d_c;
cudaMalloc((void**)&d_a, sizeof(T1) * A_ROW * A_COL);
cudaMalloc((void**)&d_b, sizeof(T1) * B_ROW * B_COL);
cudaMalloc((void**)&d_c, sizeof(T2) * A_ROW * B_COL);
unsigned int grid_rows = (A_ROW + BLOCK_SIZE - 1) / BLOCK_SIZE;
unsigned int grid_cols = (B_COL + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 grid(grid_rows, grid_cols);
dim3 blocks(BLOCK_SIZE, BLOCK_SIZE);
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
TIMER_START(_X);
for (int i = 0; i < N; ++i)
{
cudaMemcpyAsync(d_a, h_a, sizeof(T1) * A_ROW * A_COL, cudaMemcpyHostToDevice, stream);
cudaMemcpyAsync(d_b, h_b, sizeof(T1) * B_ROW * B_COL, cudaMemcpyHostToDevice, stream);
cudaEventRecord(start, 0);
MatrixMulCUDA<T1, T2> << < grid, blocks >> > (d_a, d_b, d_c, A_ROW, A_COL, B_ROW, B_COL);
cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime_mykernel, start, stop);
cudaMemcpyAsync(h_c, d_c, sizeof(T2) * A_ROW * B_COL, cudaMemcpyDeviceToHost, stream);
}
TIMER_STOP(_X);
cout <<"mykernel GPU传输、计算花费了: " << TIMER_MSEC(_X) << " ms " << "\n";
std::cout <<"mykernel GPU计算花费了:"<<elapsedTime_mykernel * N<< " ms" << std::endl;
cout << endl;
#if defined(USE_CPU_COST)
cpu_matrix_mult<T1, T2>(h_a, h_b, A_ROW, A_COL, B_COL, h_c, h_cc, 0);
#endif
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_b);
cudaFreeHost(h_a);
cudaFreeHost(h_b);
cudaFreeHost(h_c);
cudaFreeHost(h_cc);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaStreamDestroy(stream);
}
性能对比平台
CPU 英特尔八代i7-8550U, GPU: NVIDIA GeForce MAX150; 均为500个epoch的测试,
FLAOT
32*32的矩阵:
64*64的矩阵:
128*128的矩阵:
256*256的矩阵:
512*512 的矩阵:
小结
当矩阵正在64 左右,ISPC单任务的代码是运行的最快的。当矩阵数量在64 以上是ISPC的多任务快于单核,当矩阵达到256 以上时,cublas的多流运行的最快。相比自己手撸的cuda函数运行速度和cublas的完全不是一个数量级的,优化之路还长啊。
INT8
512*512 的矩阵:
小结:
使用INT8 计算时比FLOAT32 的速度提升有三倍之多,没有达到理论的四倍速度比,受到带宽和其他一些资源调度的影响。不过已经快到飞起来。
其他注意事项
注意我这cublas_.cpp是CPP结尾的文件,不是cu结尾的文件,选中cublas_.cpp,然后点击右键配置属性,使用时记得把它改为 CUDA C/C++的。
更多(代码)
也不知道写点啥了,代码都有注释,有关更多的比较测试,读者们可自行尝试。 所有的资源代码已上传至github;
|