程序在运行的过程访问主存每次是按cpu cache line 64B访问的(主流cpu cache line一般都是64B),如果命中L1 cache耗时约1ns,如果命中L2 cache耗时约3ns,如果命中L3 cache耗时约12ns,如果访问内存约50-100ns,上面这些数据都是从些网上的文章资料得知的。 如何写段代码测试下内存的访问时间,结果应该接近50-100ns ,本文介绍下测试方法: 1. 申请几段连续的虚拟内存,总大小4GB; 2. 给申请的4GB内存按页的大小4KB间隔,每页第1个字节写入值,目的是访问内存,让系统真正的分配物理内存; 3. 按照步长64B 128B … 4096B分别测试下按顺序访问已申请的4GB内存; 注:测试机器要尽量干净,未运行服务程序,以免测试过程中产生cpu切出运行其它服务程序,影响计时;
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
long long ustime(void)
{
struct timeval tv;
long long ust;
gettimeofday(&tv, NULL);
ust = ((long)tv.tv_sec)*1000000;
ust += tv.tv_usec;
return ust;
}
#define LOOPLEN 4*1024
#define STEPLEN 64
#define LOOPLENREFRSH 512
typedef struct bigData{
char buf[1024*1024];
} bigData;
struct bigData * apstData[LOOPLEN];
struct bigData * apstDataTmp[LOOPLENREFRSH];
void test()
{
long long t1;
long long t2;
long long diff;
int i,j;
int count = 0;
struct bigData * tmpData;
char *pcTmp;
int iN = 0;
printf("------------------------\n");
t1 = ustime();
for (i = 0; i < LOOPLEN; i++)
{
tmpData = apstData[i];
for (j = 0; j < sizeof(struct bigData); j += STEPLEN)
{
*((int *)((char *)tmpData + j)) = 1;
count ++;
}
}
t2 = ustime();
diff = t2-t1;
printf("diff time:%lld000ns, count:%d, memory write time: %lldns\n",diff, count, diff*1000/count);
count = 0;
t1 = ustime();
for (i = 0; i < LOOPLEN; i++)
{
tmpData = apstData[i];
for (j = 0; j < sizeof(struct bigData); j += STEPLEN)
{
iN += *((int *)((char *)tmpData + j));
count ++;
}
}
t2 = ustime();
diff = t2-t1;
printf("diff time:%lld000ns, count:%d, memory read time: %lldns, result:%d\n",diff, count, diff*1000/count, iN);
return;
}
void refreshCache()
{
int i,j;
struct bigData * tmpData;
int iN = 0;
for (i = 0; i < LOOPLENREFRSH; i++)
{
tmpData = apstDataTmp[i];
for (j = 0; j < sizeof(struct bigData); j += 64)
{
*((int *)((char *)tmpData + j)) = 1;
}
}
return;
}
void alloc(struct bigData ** apstDataParam, int iLoop)
{
int i,j;
struct bigData * tmpData;
char *pcTmp;
printf("start alloc virtual memory set value\n");
for (i = 0; i < iLoop; i++)
{
apstDataParam[i] = malloc(sizeof(struct bigData));
}
printf("start physical memory set value\n");
for (i = 0; i < iLoop; i++)
{
tmpData = apstDataParam[i];
for (j = 0; j < sizeof(struct bigData); j += 4096)
{
pcTmp = (char *)tmpData + j;
*pcTmp = '1';
}
}
}
void main()
{
alloc(apstData, LOOPLEN);
alloc(apstDataTmp, LOOPLENREFRSH);
refreshCache();
test();
refreshCache();
test();
refreshCache();
test();
refreshCache();
test();
return;
}
ARM 测试: 1.STEPLEN 设置成 64,树莓派4B,arm 64,600元一台整机还是8G内存,测试结如下: 写约64ns,读15ns,感觉读的时间不准
2.STEPLEN 设置成 4096,树莓派4B,arm 64,测试结如下:写约42ns,读40ns,稍理想点
Intel测试: 1.STEPLEN 设置成 64,腾讯云虚机,测试结如下: 写7ns,读6ns估计存在大量cache命中
2.STEPLEN 设置成 4096,腾讯云虚机,测试结如下:写45ns,读20ns
综合结论: arm64 写耗时:约65ns 读耗时:约40ns x86_64 写耗时:约45ns 读耗时:约20ns 这个结果还不算特别离谱 不同体系统架构L1 L2 cache的机制不一样,所以要把访问步长STEPLEN多设置些64的倍数,分别测试下,以降低cache命中,提升测试结果准确性。
|