hanhan博客 - 内存传输与数组求和

内存传输与数组求和

2024-6-16

CUDA

在本篇文章中，我们围绕数组求和这个例子进行。我们将会介绍内存传输、错误处理和时间统计。

1. 内存传输

此处介绍的内存传输，主要是涉及设备内存相关的。我们可以结合主机上的相关函数进行类比理解。

cudaMalloc 用于在 GPU 内存中分配空间；cudaMemset 用于设置 GPU 内存内容；cudaFree 用于释放 GPU 内存。

C	CUDA
malloc	cudaMalloc
memset	cudaMemset
free	cudaFree

内存传输我们使用 cudaMemcpy，涉及主机到设备、设备到主机和设备到设备。

不涉及主机到主机，主机到主机就用 memcpy。

cudaMemcpy 的函数原型为：

cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);

其中，dst 是目标地址；src 是源地址；count 是要复制的字节数；kind 是要复制的类型，cudaMemcpyHostToHost 主机到主机、cudaMemcpyHostToDevice 主机到设备、cudaMemcpyDeviceToHost 设备到主机、cudaMemcpyDeviceToDevice 设备到设备。

我们看一个最基本的样例。如代码清单 1 所示，我们先在主机上初始化一个随机内容的数组。接着使用 cudaMalloc 申请一块设备内存，然后使用 cudaMemcpy 把初始化的主机内容传输到申请的这块设备内存上。这样就能把这块设备内存传递给核函数使用。最后不要忘记使用 cudaFree 释放设备内存。

代码清单 1 内存传输

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <random>
#include <vector>
__global__ void mem_trs_test(int* input, int size)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
if (gid < size)
printf("id=%d, value=%d\n", gid, input[gid]);
}
void fill_vector_with_random(std::vector<int>& vec, int min, int max)
{
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> distrib(min, max);
for (int i = 0; i < vec.size(); i++)
vec[i] = distrib(gen);
}
int main()
{
int size = 150;
int byte_size = size * sizeof(int);
std::vector<int> h_input(size);
fill_vector_with_random(h_input, 0, 0xff);
int* d_input;
cudaMalloc(&d_input, byte_size);
cudaMemcpy(d_input, h_input.data(), byte_size, cudaMemcpyHostToDevice);
dim3 block(64);
dim3 grid(3);
mem_trs_test << <grid, block>> > (d_input, size);
cudaDeviceSynchronize();
cudaFree(d_input);
cudaDeviceReset();
return 0;
}

执行配置可能会设置的比数组规模大，所以核函数参数中多增加一个 size 参数，用于指示数组规模。

试了一下“越界”访问，也没引起什么“异常”。访问到的数据内容都是 0。

2. 数组求和

我们首先实现 GPU 版本的数组求和。如代码清单 2.1 所示，我们传入三个数组，把前两个数组的各个元素的和，存放到第三个数组中。各元素计算没有依赖，可以直接并行。

代码清单 2.1 数组求和

__global__ void sum_array_gpu(int* a, int* b, int* c, int size)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
if (gid < size)
{
c[gid] = a[gid] + b[gid];
}
}

计算的再快，如果计算结果是错的，那也没用。所以如代码清单 2.2 所示，我们也实现 CPU 版本的数组求和。并实现数组比较函数。

代码清单 2.2 核对

void sum_array_cpu(const std::vector<int>& a, const std::vector<int>& b, std::vector<int>& c)
{
for (int i = 0; i < c.size(); i++)
c[i] = a[i] + b[i];
}
void compare_arrays(const std::vector<int>& a, const std::vector<int>& b)
{
if (a == b)
printf("Arrays are same\n");
else
fprintf(stderr, "Arrays are different\n");
}

代码清单 2.3 是具体的求和以及核对流程。如第 7 至 15 行所示，我们初始化两个随机数组，并使用 CPU 计算它们的数组和。

第 18 至 25 行，我们初始化核函数的输入参数。我们把两个主机数组内容复制到设备内存上。

第 28 至 33 行，我们设置执行配置，运行核函数，并等待执行完毕。

第 35 行，我们把设备内存上计算得到的结果，复制到主机内存上，用于后续核对。

第 38 行，我们核对 CPU 计算的结果与 GPU 计算的结果，比较是否相同。

代码清单 2.3 数组求和并核对

int main()
{
int size = 10000;
int byte_size = size * sizeof(int);
// host pointers
std::vector<int> h_a(size);
std::vector<int> h_b(size);
std::vector<int> h_c(size);
std::vector<int> h_gpu_results(size);
fill_vector_with_random(h_a, 0, 0xff);
fill_vector_with_random(h_b, 0, 0xff);
sum_array_cpu(h_a, h_b, h_c);
// device pointers
int* d_a, * d_b, * d_c;
cudaMalloc(&d_a, byte_size);
cudaMalloc(&d_b, byte_size);
cudaMalloc(&d_c, byte_size);
// memory transfer from host to device
cudaMemcpy(d_a, h_a.data(), byte_size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b.data(), byte_size, cudaMemcpyHostToDevice);
// launch the grid
int block_size = 128;
dim3 block(block_size);
dim3 grid((size + block_size - 1) / block_size);
sum_array_gpu << <grid, block>> > (d_a, d_b, d_c, size);
cudaDeviceSynchronize();
cudaMemcpy(h_gpu_results.data(), d_c, byte_size, cudaMemcpyDeviceToHost);
// array comparison
compare_arrays(h_c, h_gpu_results);
cudaFree(d_c);
cudaFree(d_b);
cudaFree(d_a);
cudaDeviceReset();
return 0;
}

注意这边的 CUDA 内存相关操作单位都是字节。不要写错了。

3. 错误处理

CUDA API 的调用可能会因为各种原因失败，比如内存分配失败、非法操作、设备不支持等。所以生产环境上，规范的做法是检查 CUDA API 调用的返回值。

和平时的 CPU 编程要求是一样的。比如，生产环境上也要检查 malloc 是否分配成功。

CUDA 错误码通过 cudaError_t 枚举类型返回。可以使用 cudaGetErrorString 函数，将错误码转成描述性字符串，方便调试。

比如，在代码清单 3.1 中，我们对 cudaMalloc 进行了错误处理。

代码清单 3.1 错误处理

cudaError error;
// device pointers
int* d_a, * d_b, * d_c;
error = cudaMalloc(&d_a, byte_size);
if (error != cudaSuccess)
{
fprintf(stderr, "Error: %s\n", cudaGetErrorString(error));
}

如果仅是错误打印逻辑，并且嫌代码太冗长的话，我们可以使用代码清单 3.2 中的宏，用它“包裹”调用的 CUDA API 函数。

代码清单 3.2 错误打印宏

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}

4. 时间统计

最后，我们对代码中的关键执行过程进行时间统计。以此感受一下 CPU 和 GPU 的执行时间差异。

如代码清单 4 所示，我们对以下流程进行统计：

1. CPU 数组求和时间。对应第 16 行的过程。

2. 主机内存传输到设备内存的时间。对应第 27 至 28 行的过程。

3. GPU 数组求和时间。对应第 37 至 38 行的过程。

4. 设备内存传输到主机内存的时间。对应第 42 行的过程。

GPU 编程要算上内存传输的额外消耗，所以 2 + 3 + 4 就是 GPU 进行数组求和的总时间。

代码清单 4 时间统计

int main()
{
int size = 10000;
int byte_size = size * sizeof(int);
// host pointers
std::vector<int> h_a(size);
std::vector<int> h_b(size);
std::vector<int> h_c(size);
std::vector<int> h_gpu_results(size);
fill_vector_with_random(h_a, 0, 0xff);
fill_vector_with_random(h_b, 0, 0xff);
auto start_cpu = std::chrono::high_resolution_clock::now();
sum_array_cpu(h_a, h_b, h_c);
auto end_cpu = std::chrono::high_resolution_clock::now();
// device pointers
int* d_a, * d_b, * d_c;
cudaMalloc(&d_a, byte_size);
cudaMalloc(&d_b, byte_size);
cudaMalloc(&d_c, byte_size);
// memory transfer from host to device
auto start_memcpy_h2d = std::chrono::high_resolution_clock::now();
cudaMemcpy(d_a, h_a.data(), byte_size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b.data(), byte_size, cudaMemcpyHostToDevice);
auto end_memcpy_h2d = std::chrono::high_resolution_clock::now();
// launch the grid
int block_size = 128;
dim3 block(block_size);
dim3 grid((size + block_size - 1) / block_size);
auto start_gpu = std::chrono::high_resolution_clock::now();
sum_array_gpu << <grid, block>> > (d_a, d_b, d_c, size);
cudaDeviceSynchronize();
auto end_gpu = std::chrono::high_resolution_clock::now();
auto start_memcpy_d2h = std::chrono::high_resolution_clock::now();
cudaMemcpy(h_gpu_results.data(), d_c, byte_size, cudaMemcpyDeviceToHost);
auto end_memcpy_d2h = std::chrono::high_resolution_clock::now();
// array comparison
compare_arrays(h_c, h_gpu_results);
printf("CPU sum time: %lld us\n",
std::chrono::duration_cast<std::chrono::microseconds>(end_cpu - start_cpu).count());
auto gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_gpu - start_gpu);
printf("GPU kernel execution time sum time: %lld us\n", gpu_duration.count());
auto h2d_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_memcpy_h2d - start_memcpy_h2d);
printf("Mem transfer host to device: %lld us\n", h2d_duration.count());
auto d2h_duration = std::chrono::duration_cast<std::chrono::microseconds>(end_memcpy_d2h - start_memcpy_d2h);
printf("Mem transfer device to host: %lld us\n", d2h_duration.count());
printf("Total GPU time: %lld us\n",
(gpu_duration + h2d_duration + d2h_duration).count());
cudaFree(d_c);
cudaFree(d_b);
cudaFree(d_a);
cudaDeviceReset();
return 0;
}

运行程序，打印的内容如下，可以看到此时 CPU 的执行速度是快于 GPU 的。

CPU sum time: 123 us
GPU kernel execution time sum time: 4147 us
Mem transfer host to device: 287 us
Mem transfer device to host: 48 us
Total GPU time: 4482 us

应该是现在的操作简单，且数据规模不大。我们把数组的大小扩大 100 倍，可以发现此时 GPU 就比 CPU 快了。同时可以看到 GPU 执行的时间基本稳定，主要是数据传输的耗时变大。

CPU sum time: 10069 us
GPU kernel execution time sum time: 4399 us
Mem transfer host to device: 1591 us
Mem transfer device to host: 737 us
Total GPU time: 6727 us

Tongtong

内存传输与数组求和

1. 内存传输

2. 数组求和

3. 错误处理

4. 时间统计