15 RISC-V CPU的FPGA实现
前面我们用软件实现了一个RISC-V的CPU,虽然是用HDL4SE建模实现,但是仍然不是RTL的,没法直接在硬件上运行,充其量算是RISC-V CPU的CModel。本次我们实现一个用verilog写的RISC-V CPU,能够在FPGA上跑起来。然而重点是在于介绍从软件CModel到FPGA中间的开发过程,可以看到,整个过程比直接写verilog还是要强,很多开发设计迭代过程在软件建模中实现了,软件做到RTL时,再用硬件语言实现就水到渠成了。反过来,如果直接看最后的verilog代码,其实很难想象能够直接这么设计出来。
15.1 目标
15.1.1 FPGA开发板
要做FPGA应用开发,需要准备很多前置知识,除了verilog语言的学习之外,还有一些硬件相关的内容。我们选用一个硬件方面要求比较低的FPGA开发板DE1-SOC,它采用了Altera(Intel)的Cyclone V(SOC) FPGA芯片,其中有一个ARM,不过本次我们不使用它,我们只使用它的FPGA部分。 DE1-SOC板子有CLK(50MHz外部输入),GPIO(两组共70个,inout类型),KEY(input 4个,按下松开后自动恢复),SW(input10个,二值输入,开关接入),7段数码管(共阳极,6个),LED灯(10个),64MB SDRAM,视频输入,视频输出(VGA DAC),还有PS/2,IR,AUDIO等接口,CPU那边还能接网络,USB,SD卡等。软件上还有一个自动建立工程的软件,省了很多配置FPGA管脚的工作,比较适合软件工程师入门,应该说做CPU的开发板时比较合适的。 我们这次的目标是在这个FPGA开发板上把RISC-V CPU跑起来,并实现软件的计数器,将计数结果输出到LED上,通过KEY或者SW来控制计数器的动作。 它的外部接口示意图如下(来自友terasIC官网):
我们用DE1-Soc自带的SystemBuilder工具生成FPGA的顶层模块及Altera的工程文件: 生成的工程文件中已经配置好了FPGA的管脚,顶层模型文件如下:
module de1_riscv(
output ADC_CONVST,
output ADC_DIN,
input ADC_DOUT,
output ADC_SCLK,
input AUD_ADCDAT,
inout AUD_ADCLRCK,
inout AUD_BCLK,
output AUD_DACDAT,
inout AUD_DACLRCK,
output AUD_XCK,
input CLOCK2_50,
input CLOCK3_50,
input CLOCK4_50,
input CLOCK_50,
output [12:0] DRAM_ADDR,
output [1:0] DRAM_BA,
output DRAM_CAS_N,
output DRAM_CKE,
output DRAM_CLK,
output DRAM_CS_N,
inout [15:0] DRAM_DQ,
output DRAM_LDQM,
output DRAM_RAS_N,
output DRAM_UDQM,
output DRAM_WE_N,
output FPGA_I2C_SCLK,
inout FPGA_I2C_SDAT,
output [6:0] HEX0,
output [6:0] HEX1,
output [6:0] HEX2,
output [6:0] HEX3,
output [6:0] HEX4,
output [6:0] HEX5,
input IRDA_RXD,
output IRDA_TXD,
input [3:0] KEY,
output [9:0] LEDR,
inout PS2_CLK,
inout PS2_CLK2,
inout PS2_DAT,
inout PS2_DAT2,
input [9:0] SW,
input TD_CLK27,
input [7:0] TD_DATA,
input TD_HS,
output TD_RESET_N,
input TD_VS,
output VGA_BLANK_N,
output [7:0] VGA_B,
output VGA_CLK,
output [7:0] VGA_G,
output VGA_HS,
output [7:0] VGA_R,
output VGA_SYNC_N,
output VGA_VS,
inout [35:0] GPIO
);
endmodule
我们先修改它,验证板子能够正确运行:
module de1_riscv(
wire wClk = CLOCK_50;
wire nwReset = KEY[3];
reg [6:0] led0;
reg [6:0] led1;
reg [6:0] led2;
reg [6:0] led3;
reg [6:0] led4;
reg [6:0] led5;
assign HEX0 = ~led0;
assign HEX1 = ~led1;
assign HEX2 = ~led2;
assign HEX3 = ~led3;
assign HEX4 = ~led4;
assign HEX5 = ~led5;
always @(posedge wClk) begin
if (!nwReset) begin
led0 <= 8'h3f;
led1 <= 8'h3f;
led2 <= 8'h3f;
led3 <= 8'h3f;
led4 <= 8'h3f;
led5 <= 8'h3f;
end else begin
if (SW[8]) begin
led0 <= 8'h06;
led1 <= 8'h06;
led2 <= 8'h06;
led3 <= 8'h07;
led4 <= 8'h07;
led5 <= 8'h07;
end
else if (SW[9]) begin
led0 <= 8'h3f;
led1 <= 8'h06;
led2 <= 8'h5b;
led3 <= 8'h4f;
led4 <= 8'h66;
led5 <= 8'h6d;
end
end
end
endmodule
这段代码用Quartus II综合后下载到FPGA板子中(这些操作不详细说明了),能够在数码管显示数字,按键KEY[3],显示000000,松开后拨动开关SW[9]和SW[8],分别显示54321和777111,这样就表示FPGA板子正常运行起来了。
15.1.2 设计目标
我们这次的目标是,将前面的RISC-V CPU核改写为verilog语言实现,在FPGA开发板上跑起来,运行前面的计数器软件,能够将计数值显示在数码管上,并读出按键信息,控制计数的行为(清零,暂停,继续)。数码管作为一个硬件设备挂在CPU的外部读写口上,通过写地址0xF0000010和0xF0000014来控制数码管显示,读0xF0000000得到输入信息。计数器软件代码如下:
const unsigned int segcode[10] =
{
0x3F,
0x06,
0x5B,
0x4F,
0x66,
0x6d,
0x7d,
0x07,
0x7f,
0x6f,
};
unsigned int num2seg(unsigned int num)
{
return segcode[num % 10];
}
int main(int argc, char* argv[])
{
unsigned long long count, ctemp;
int countit = 1;
unsigned int* ledkey = (unsigned int*)0xF0000000;
unsigned int* leddata = (unsigned int*)0xf0000010;
count = 0;
leddata[0] = 0x6f7f077d;
leddata[1] = 0x6d664f5b;
do {
unsigned int key;
key = *ledkey;
if (key & 1) {
count = 0;
}
else if (key & 2) {
countit = 0;
}
else if (key & 4) {
countit = 1;
}
if (countit)
count++;
ctemp = count;
leddata[0] = num2seg(ctemp) |
((num2seg(ctemp / 10ll)) << 8) |
((num2seg(ctemp / 100ll)) << 16) |
((num2seg(ctemp / 1000ll)) << 24);
ctemp /= 10000ll;
leddata[1] = num2seg(ctemp) |
((num2seg(ctemp / 10ll)) << 8) |
((num2seg(ctemp / 100ll)) << 16) |
((num2seg(ctemp / 1000ll)) << 24);
ctemp /= 10000ll;
leddata[2] = num2seg(ctemp) |
((num2seg(ctemp / 10ll)) << 8);
} while (1);
return 1;
}
这段代码用前面准备的RISC-V工具链编译连接后,生成一个ELF文件,通过工具链中的objcopy生成FPGA能够读的格式(本来支持ihex格式,但是不知道怎么回事,Altera ModelSim仿真时总是读不对,于是就用verilog格式然后在软件仿真开始时转换为MIF文件)。我们打算把代码和数据都放在FPGA的RAM中,利用FPGA的RAM IP能够用数据初始化的功能,将ELF文件生成的数据文件放在FPGA的一个RAM中。 这个过程中还涉及到ELF文件生成过程中的内存映象部署的问题,默认的工具链连接时,把运行起始点放在0x00010074开始的地方,前面64KB就空出来了,我们做这个应用时,希望占用的FPGA资源尽可能少,比如用8KB的RAM,就可以支持这个应用运行,其中4KB是代码和只读数据,4KB是程序的数据区和堆栈(当然这个应用没有调用诸如malloc之类的动态内存管理方面的API,因此,堆空间没有实现)。 为此,我们修改了默认的链接脚本,让程序从0x00000000开始,这样就可以在8KB的地址空间内完成运行。具体连接脚本的修改结果请看git文件库。
15.2 CModel模式改到RTL
跟SystemC一样,HDL4SE用来对数字电路建模,最大的好处是能够利用c/c++的资源,这会给建模带来很多方便,能够快速把数字电路的模型建立起来,并且能够仿真运行,可以验证软件工具链以及实现的算法等。我们做这个事情的时候,假定计算资源和存储资源都是不受限制的,而且可以用c/c++的一些表达方式进行算法描述。这样做出来的模型,往往不是RTL的,只能作为数字电路的CModel来用。从CModel改到RTL,其实就是将建模过程中用过的c/c++的表达方式,逐步改为全部用HDL4SE的建模方式实现。具体到RISC-V的这个模型,我们分几步来完成。
15.2.1 存储器实现
在前面实现RISC-V时,内存是用c的指针实现的,这是前面一节中的模型定义:
MODULE_DECLARE(riscv_core)
unsigned int *ram;
unsigned int regs[32];
unsigned int ramsize;
unsigned int dstreg;
unsigned int dstvalue;
unsigned int ramdstaddr;
unsigned int ramdstvalue;
unsigned int ramdstwidth;
END_MODULE_DECLARE(riscv_core)
......
MODULE_INIT(riscv_core)
int i;
pobj->ramsize = RAMSIZE * 4;
pobj->ram = malloc(pobj->ramsize);
loadExecImage(pobj->ram, pobj->ramsize);
......
可以看到其中的ram是用c语言的指针实现的,这样做无法对应到硬件实现,因此我们第一步就是把ram移到模型的外部,也通过模型的read/write系列接口来访问。 具体做法是,用Altera的IP生成工具生成RAM,字长32位,总共2048个字(8KB),生成的RAM只有一个读写口,它的接口如下:
module ram8kb (
address,
byteena,
clock,
data,
wren,
q);
input [10:0] address;
input [3:0] byteena;
input clock;
input [31:0] data;
input wren;
output [31:0] q;
endmodule
具体的用法见Altera的相关文档,为了在HDL4SE系统中进行仿真,我们照这个接口建立HDL4SE模型:
#define riscv_ram_MODULE_VERSION_STRING "0.4.0-20210825.0610 RISCV RAM cell"
#define riscv_ram_MODULE_CLSID CLSID_HDL4SE_RISCV_RAM
#define M_ID(id) riscv_ram##id
IDLIST
VID(address),
VID(byteena),
VID(clock),
VID(data),
VID(wren),
VID(q),
VID(lastaddr),
END_IDLIST
MODULE_DECLARE(riscv_ram)
unsigned int* ram;
unsigned int ramaddr;
unsigned int ramwrdata;
unsigned int ramwren;
unsigned int rambyteena;
END_MODULE_DECLARE(riscv_ram)
DEFINE_FUNC(riscv_ram_gen_q, "address, byteena, data, wren, lastaddr") {
unsigned int lastaddr;
lastaddr = vget(lastaddr);
if (lastaddr < RAMSIZE)
vput(q, pobj->ram[vget(lastaddr)]);
else
vput(q, 0xdeadbeef);
} END_DEFINE_FUNC
DEFINE_FUNC(riscv_ram_clktick, "") {
pobj->ramwren = vget(wren);
pobj->ramwrdata = vget(data);
pobj->rambyteena = vget(byteena);
pobj->ramaddr = vget(address);
vput(lastaddr, vget(address));
} END_DEFINE_FUNC
DEFINE_FUNC(riscv_ram_deinit, "") {
if (pobj->ram != NULL)
free(pobj->ram);
} END_DEFINE_FUNC
DEFINE_FUNC(riscv_ram_setup, "") {
if (pobj->ramwren) {
unsigned int mask =
(pobj->rambyteena & 1 ? 0x000000ff : 0)
| (pobj->rambyteena & 2 ? 0x0000ff00 : 0)
| (pobj->rambyteena & 4 ? 0x00ff0000 : 0)
| (pobj->rambyteena & 8 ? 0xff000000 : 0);
pobj->ram[pobj->ramaddr] = (pobj->ram[pobj->ramaddr] & (~mask))
| (pobj->ramwrdata & mask);
}
pobj->ramwren = 0;
} END_DEFINE_FUNC
static int loadExecImage(unsigned char* data, int maxlen)
{
....
}
MODULE_INIT(riscv_ram)
pobj->ram = malloc(RAMSIZE * 4);
loadExecImage(pobj->ram, RAMSIZE * 4);
pobj->ramwren = 0;
PORT_IN(clock, 1);
PORT_IN(wren, 1);
PORT_IN(address, 11);
PORT_IN(data, 32);
PORT_IN(byteena, 4);
GPORT_OUT(q, 32, riscv_ram_gen_q);
REG(lastaddr, 11);
CLKTICK_FUNC(riscv_ram_clktick);
SETUP_FUNC(riscv_ram_setup);
DEINIT_FUNC(riscv_ram_deinit);
END_MODULE_INIT(riscv_ram)
这个模型当然也是大量用了c/c++的描述,不过因为这是FPGA的IP,在FPGA应用时是由Altera生成的,因此这里的描述仅供HDL4SE仿真使用,所以也就无所谓了。 当然我们面临的问题是,这里的存储器访问在读的时候有1拍的延时,而且读写不能同时进行,这样前面建模中使用c/c++在同一个周期中读写的方式要进行修改,这点我们后面的寄存器文件修改后再一起描述。
15.2.2 寄存器文件
RISC-V中有32个32位寄存器,一个PC和31个通用寄存器,这些寄存器当然可以使用HDL4SE的寄存器实现,但是要实现按照寄存器号读写寄存器,其实是一个多路选择电路,为了简化电路,我们把寄存器也用1个端口的RAM实现,当然也放在CPU外面实现(当然,PC寄存器还是放在核内用寄存器实现),这样我们也为寄存器访问增加相应的接口。寄存器文件实际就是一个ram,AlteraIP工具生成的接口如下:
module regfile (
address,
byteena,
clock,
data,
wren,
q);
input [4:0] address;
input [3:0] byteena;
input clock;
input [31:0] data;
input wren;
output [31:0] q;
endmodule
为了HDL4SE仿真运行,我们同样用HDL4SE建模语言建模如下,我们特别增加了每个寄存器的访问接口,可以在仿真时在VCD文件中记录每个寄存器的值:
#define riscv_regfile_MODULE_VERSION_STRING "0.4.0-20210825.1540 RISCV REGFILE cell"
#define riscv_regfile_MODULE_CLSID CLSID_HDL4SE_RISCV_REGFILE
#define M_ID(id) riscv_regfile##id
IDLIST
VID(address),
VID(byteena),
VID(clock),
VID(data),
VID(wren),
VID(q),
VID(lastaddr),
VID(x1),
......
VID(x31),
END_IDLIST
#define REGCOUNT 32
MODULE_DECLARE(riscv_regfile)
unsigned int ram[REGCOUNT];
unsigned int ramaddr;
unsigned int ramwrdata;
unsigned int ramwren;
unsigned int rambyteena;
END_MODULE_DECLARE(riscv_regfile)
DEFINE_FUNC(riscv_regfile_gen_q, "address, byteena, data, wren, lastaddr") {
unsigned int lastaddr;
lastaddr = vget(lastaddr);
if (lastaddr == 0)
vput(q, 0);
else
if (lastaddr < REGCOUNT)
vput(q, pobj->ram[vget(lastaddr)]);
else {
printf("We have %d registers only, but you want to read %d\n", REGCOUNT, lastaddr);
}
} END_DEFINE_FUNC
DEFINE_FUNC(riscv_regfile_clktick, "") {
pobj->ramwren = vget(wren);
pobj->ramwrdata = vget(data);
pobj->rambyteena = vget(byteena);
pobj->ramaddr = vget(address);
vput(lastaddr, vget(address));
} END_DEFINE_FUNC
DEFINE_FUNC(riscv_regfile_setup, "") {
if (pobj->ramwren) {
unsigned int mask =
(pobj->rambyteena & 1 ? 0x000000ff : 0)
| (pobj->rambyteena & 2 ? 0x0000ff00 : 0)
| (pobj->rambyteena & 4 ? 0x00ff0000 : 0)
| (pobj->rambyteena & 8 ? 0xff000000 : 0);
pobj->ram[pobj->ramaddr] = (pobj->ram[pobj->ramaddr] & (~mask))
| (pobj->ramwrdata & mask);
}
pobj->ramwren = 0;
} END_DEFINE_FUNC
DEFINE_FUNC(riscv_regfile_register, "wren, data, byteena, address") {
int i;
for (i = 1; i < 32; i++)
vput_idx(VID(x1) + i - 1, pobj->ram[i]);
} END_DEFINE_FUNC
MODULE_INIT(riscv_regfile)
pobj->ramwren = 0;
PORT_IN(clock, 1);
PORT_IN(wren, 1);
PORT_IN(address, 5);
PORT_IN(data, 32);
PORT_IN(byteena, 4);
GPORT_OUT(q, 32, riscv_regfile_gen_q);
REG(lastaddr, 5);
GWIRE(x1, 32, riscv_regfile_register);
......
GWIRE(x31, 32, riscv_regfile_register);
CLKTICK_FUNC(riscv_regfile_clktick);
SETUP_FUNC(riscv_regfile_setup);
END_MODULE_INIT(riscv_regfile)
由于这个寄存器文件只有一个读写口,而且读写不能同时进行,所以对CPU中的寄存器访问就有限制,不能在一拍中读写两个源寄存器并且写一个目的寄存器了,因此一方面对RISC-V CPU模型的接口有所修改,另一方面对其实现也有较大改动。下面是模型接口,增加了寄存器读写的接口信号:
(*
HDL4SE="LCOM",
CLSID="638E8BC3-B0E0-41DC-9EDD-D35A39FD8051",
softmodule="hdl4se"
*)
module riscv_core(
input wClk, nwReset,
output wWrite,
output [31:0] bWriteAddr,
output [31:0] bWriteData,
output [3:0] bWriteMask,
output reg wRead,
output reg [31:0] bReadAddr,
input [31:0] bReadData,
output reg [4:0] regno,
output reg [3:0] regena,
output reg [31:0] regwrdata,
output reg regwren,
input [31:0] regrddata
);
这样,FPGA的主模块就可以写成:
`define USECLOCK50_1
module de1_riscv(
);
`ifdef USECLOCK50
wire wClk = CLOCK_50;
`else
wire clk100MHz, clk75MHz, clklocked;
clk100M clk100(.refclk(CLOCK_50),
.rst(~KEY[3]),
.outclk_0(clk100MHz),
.outclk_1(clk75MHz),
.locked(clklocked));
wire wClk = clk100MHz;
`endif
wire nwReset = KEY[3];
wire wWrite, wRead;
wire [31:0] bWriteAddr, bWriteData, bReadAddr, bReadData, bReadDataRam, bReadDataKey;
wire [3:0] bWriteMask;
assign bReadDataKey = {18'b0, KEY, SW};
reg readcmd;
reg [31:0] readaddr;
wire wRead_out = readcmd;
wire [31:0] bReadAddr_out = readaddr;
always @(posedge wClk) begin
if (!nwReset) begin
readcmd <= 1'b0;
readaddr <= 32'b0;
end else begin
readcmd <= wRead;
readaddr <= bReadAddr;
end
end
assign bReadData =
((bReadAddr_out & 32'hffffff00) == 32'hf0000000) ? bReadDataKey : (
((bReadAddr_out & 32'hffffc000) == 32'h00000000) ? bReadDataRam : (0)
);
wire [10:0] ramaddr;
assign ramaddr = wWrite?bWriteAddr[12:2]:bReadAddr[12:2];
wire [4:0] regno;
wire [3:0] regena;
wire [31:0] regwrdata;
wire regwren;
wire [31:0] regrddata;
regfile regs(regno, regena, wClk, regwrdata, regwren, regrddata);
ram8kb ram(ramaddr, ~bWriteMask, wClk, bWriteData,
((bWriteAddr & 32'hffffc000) == 0)?wWrite:1'b0, bReadDataRam);
riscv_core core(wClk, nwReset, wWrite, bWriteAddr, bWriteData, bWriteMask,
wRead, bReadAddr, bReadData,
regno, regena, regwrdata, regwren, regrddata);
reg [6:0] led0;
reg [6:0] led1;
reg [6:0] led2;
reg [6:0] led3;
reg [6:0] led4;
reg [6:0] led5;
assign HEX0 = ~led0;
assign HEX1 = ~led1;
assign HEX2 = ~led2;
assign HEX3 = ~led3;
assign HEX4 = ~led4;
assign HEX5 = ~led5;
always @(posedge wClk) begin
if (!nwReset) begin
led0 <= 8'h3f;
led1 <= 8'h3f;
led2 <= 8'h3f;
led3 <= 8'h3f;
led4 <= 8'h3f;
led5 <= 8'h3f;
end else begin
if (SW[8]) begin
led0 <= 8'h06;
led1 <= 8'h06;
led2 <= 8'h06;
led3 <= 8'h07;
led4 <= 8'h07;
led5 <= 8'h07;
end
else if (SW[9]) begin
led0 <= 8'h3f;
led1 <= 8'h06;
led2 <= 8'h5b;
led3 <= 8'h4f;
led4 <= 8'h66;
led5 <= 8'h6d;
end
else if (wWrite && ((bWriteAddr & 32'hffffff00) == 32'hf0000000)) begin
if (bWriteAddr[7:0] == 8'h10) begin
led0 <= bWriteData[6:0];
led1 <= bWriteData[14:8];
led2 <= bWriteData[22:16];
led3 <= bWriteData[30:24];
end else if (bWriteAddr[7:0] == 8'h14) begin
led4 <= bWriteData[6:0];
led5 <= bWriteData[14:8];
end
end
end
end
endmodule
15.2.3 RISC-V CPU内部状态机
为了满足存储器外移和寄存器外移带来的影响,我们将RISC-V CPU的实现从原来的一拍一条指令改为一条指令分多拍完成,内部用状态机实现,考虑到每条指令的性质不同,我们允许不同的指令执行的周期数以及经过的状态不一样。下面是RISC-V CPU内部的状态:
enum riscv_core_state {
RISCVSTATE_INIT_REGX1,
RISCVSTATE_INIT_REGX2,
RISCVSTATE_READ_INST,
RISCVSTATE_READ_RS1,
RISCVSTATE_READ_RS2,
RISCVSTATE_STORE_RS2,
RISCVSTATE_EXEC_INST,
RISCVSTATE_WRITE_RD,
RISCVSTATE_WAIT_LD,
RISCVSTATE_WAIT_ST,
RISCVSTATE_WAIT_DIV,
};
我们为了剔除c语言实现的部分,增加了两个初始化寄存器的状态,用来初始化x1(入口地址)和x2(内存容量)。 每个状态描述如下:
- RISCVSTATE_INIT_REGX1:系统复位后的状态,该状态下通过寄存器写接口将x1寄存器设置为程序入口地址(0x8c),然后转移到RISCVSTATE_INIT_REGX2。
- RISCVSTATE_INIT_REGX2:该状态下通过寄存器写接口将x2寄存器设置为内存大小(2048 * 4-16),然后转移到RISCVSTATE_READ_INST。
- RISCVSTATE_READ_INST:该状态将PC的值送到RAM读接口,发起读RAM周期,返回指令,状态转移到RISCVSTATE_READ_RS1。
- RISCVSTATE_READ_RS1:该状态记录读到的指令到寄存器instr中,并同时解析出rs1的编号,送到寄存器读端口读寄存器,状态转移到RISCVSTATE_READ_RS2。
- RISCVSTATE_READ_RS2:该状态下记录读到的寄存器值到寄存器rs1中,然后从instr中解析出rs2的编号,发送到寄存器读端口读寄存器,状态转移到RISCVSTATE_STORE_RS2。
- RISCVSTATE_STORE_RS2:该状态下记录读到的寄存器值到寄存器rs2中,然后转移到状态RISCVSTATE_EXEC_INST。
- RISCVSTATE_EXEC_INST:该状态执行指令,按照指令的类型分别进行状态转移,如果是转移指令,则设置新的PC,如果是alu/alui指令,则设置会写的寄存器和值,转移到RISCVSTATE_WRITE_RD状态,如果是比较复杂的DIV/MOD指令,则设置等待周期数,并转移到RISCVSTATE_WAIT_DIV状态,如果是LOAD指令,则发送读存储器请求,并转移到RISCVSTATE_WAIT_LD,如果是STORE指令,则发出写RAM信号,并转移到RISCVSTATE_WAIT_ST状态。
- RISCVSTATE_WRITE_RD:按照前面设置的回写寄存器的编号和值,发起写寄存器周期,转到RISCVSTATE_READ_INST。
- RISCVSTATE_WAIT_LD:把读RAM返回的值设置到会写寄存器中,转移RISCVSTATE_WRITE_RD。
- RISCVSTATE_WAIT_ST:转移到RISCVSTATE_READ_INST。
- RISCVSTATE_WAIT_DIV:递减等待DIV结果的计数器,如果计数器为零,则将结果写到回写寄存器中,然后转移到RISCVSTATE_WRITE_RD。
这个状态机中的几个状态有一定的冗余性,照这个实现,一条指令至少要6个周期才能执行完,很浪费。但是这样描述实现上比较简单,容易读懂,容易被初学者接受。真正要用这个核时,需要对状态机进行优化。
15.2.4 模型函数改造
前面一节实现RISC-V CPU时,使用了c语言函数中间的调用,这样的做法也不是很简单能够用RTL实现的,因此我们对每个函数进行修改,并且将clktick函数和setup函数中实现的功能分解到寄存器的更新函数中,然每个寄存器都绑定自己的更新函数,并且取消了函数间的调用。这样改造的函数就比较小,而且对应到每个线网或寄存器,原则上线网和寄存器生成不在一个函数中,这样就达到了RTL的要求,下面看几个典型的函数: 状态转移函数:
DEFINE_FUNC(riscv_core_gen_state, "state, instr, nwReset") {
if (vget(nwReset) == 0) {
vput(state, RISCVSTATE_INIT_REGX1);
}
else {
int state = vget(state);
switch (state) {
case RISCVSTATE_INIT_REGX1: {
vput(state, RISCVSTATE_INIT_REGX2);
}break;
case RISCVSTATE_INIT_REGX2: {
vput(state, RISCVSTATE_READ_INST);
}break;
case RISCVSTATE_READ_INST: {
vput(state, RISCVSTATE_READ_RS1);
}break;
case RISCVSTATE_READ_RS1: {
vput(state, RISCVSTATE_READ_RS2);
}break;
case RISCVSTATE_READ_RS2: {
vput(state, RISCVSTATE_EXEC_INST);
}break;
case RISCVSTATE_WRITE_RD: {
vput(state, RISCVSTATE_READ_INST);
}break;
case RISCVSTATE_EXEC_INST: {
unsigned int instr = vget(instr);
unsigned int opcode = instr & 0x7f;
opcode >>= 2;
if (opcode == 0x00)
vput(state, RISCVSTATE_WAIT_LD);
else if (opcode == 0x08)
vput(state, RISCVSTATE_WAIT_ST);
else if (opcode == 0x0c && (instr & (1 << 25)) && (func3 & 4)) {
vput(state, RISCVSTATE_WAIT_DIV);
vput(divclk, 11);
}
else
vput(state, RISCVSTATE_WRITE_RD);
}break;
case RISCVSTATE_WAIT_LD: {
vput(state, RISCVSTATE_WRITE_RD);
}break;
case RISCVSTATE_WAIT_ST: {
vput(state, RISCVSTATE_READ_INST);
}break;
case RISCVSTATE_WAIT_DIV: {
if (vget(divclk) == 0)
vput(state, RISCVSTATE_WRITE_RD);
else
vput(divclk, vget(divclk) - 1);
}break;
}
}
} END_DEFINE_FUNC
我们在READ_RS2周期来解码instr中的imm
DEFINE_FUNC(riscv_core_gen_imm, "instr, state") {
if (vget(state) == RISCVSTATE_READ_RS2) {
unsigned int instr;
unsigned int opcode;
instr = vget(instr);
opcode = instr & 0x7f;
opcode >>= 2;
switch (opcode) {
case 0x0d: {
vput(imm, instr & 0xfffff000);
}break;
case 0x05: {
vput(imm, instr & 0xfffff000);
}break;
case 0x1b: {
unsigned int imm;
imm = (instr & (1 << 20)) ? (1 << 11) : 0;
imm |= (instr >> 20) & 0x7fe;
imm |= instr & 0xff000;
imm |= instr & (1 << 31) ? 0x100000 : 0;
imm = sign_expand(imm, 20);
vput(imm, imm);
}break;
case 0x19: {
unsigned int imm;
imm = instr >> 20;
imm = sign_expand(imm, 11);
vput(imm, imm);
}break;
case 0x18: {
unsigned int imm;
unsigned int immh;
unsigned int immd;
immh = instr >> 25;
immd = (instr >> 7) & 0x1f;
imm = immd & 0x1e;
imm |= (immh & 0x3f) << 5;
imm |= (immd & 1) << 11;
imm |= (immh & 0x40) ? (1 << 12) : 0;
imm = sign_expand(imm, 12);
vput(imm, imm);
}break;
case 0x00: {
unsigned int imm;
imm = instr >> 20;
imm = sign_expand(imm, 11);
vput(imm, imm);
}break;
case 0x08: {
unsigned int imm;
imm = ((instr >> 20) & 0xfe0) | ((instr >> 7) & 0x1f);
imm = sign_expand(imm, 11);
vput(imm, imm);
}break;
case 0x04: {
unsigned int imm;
imm = instr >> 20;
imm = sign_expand(imm, 11);
vput(imm, imm);
}break;
}
}
} END_DEFINE_FUNC
这样改造后,HDL4SE模型中就没有太多c语言描述的表达方式了,而且满足RTL的要求,于是后面改为verilog实现就是水到渠成的事情了。
15.3 HDL4SE模型到Verilog
将前面修改好的满足RTL的HDL4SE模型中的接口和函数逐个改写成verilog实现,其实是比较简单的,这里为了实现的方便,我们对alu指令中的乘法和除法使用了Altera的IP实现其中乘法用了Altera FPGA中的DSP块。除法直接用它的IP生成器生成,由于除法比较复杂,单拍完成的配置主频只有10MHZ,所以我们选用了多拍流水线,3级流水可以达到25MHZ,4级可以达到34MHZ,反正除法用得比较少,使用是慢点就慢点,干脆设置为12级流水(占用的寄存器多一些),主频的瓶颈就不在除法器这里了,综合的结果最坏情况下整体可以达到85MHZ,常温下运行到100MHZ计数器应用还是正常的。下面贴出完整的verilog语言实现的RISC-V核:
`define RISCVSTATE_INIT_REGX1 0
`define RISCVSTATE_INIT_REGX2 1
`define RISCVSTATE_READ_INST 2
`define RISCVSTATE_READ_RS1 3
`define RISCVSTATE_READ_RS2 4
`define RISCVSTATE_STORE_RS2 5
`define RISCVSTATE_WRITE_RD 6
`define RISCVSTATE_EXEC_INST 7
`define RISCVSTATE_WAIT_LD 8
`define RISCVSTATE_WAIT_ST 9
`define RISCVSTATE_WAIT_DIV 10
`define RAMSIZE 2048
(*
HDL4SE="LCOM",
CLSID="638E8BC3-B0E0-41DC-9EDD-D35A39FD8051",
softmodule="hdl4se"
*)
module riscv_core(
input wClk, nwReset,
output wWrite,
output [31:0] bWriteAddr,
output [31:0] bWriteData,
output [3:0] bWriteMask,
output reg wRead,
output reg [31:0] bReadAddr,
input [31:0] bReadData,
output reg [4:0] regno,
output reg [3:0] regena,
output reg [31:0] regwrdata,
output reg regwren,
input [31:0] regrddata
);
reg [31:0] pc;
reg [31:0] instr;
reg [31:0] rs1;
reg [31:0] rs2;
reg write;
reg [31:0] writeaddr;
reg [31:0] writedata;
reg [3:0] writemask;
reg [4:0] readreg;
reg [3:0] state;
reg [31:0] imm;
reg [4:0] dstreg;
reg [31:0] dstvalue;
reg [1:0] ldaddr;
reg [4:0] divclk;
assign wWrite = write;
assign bWriteAddr = writeaddr;
assign bWriteData = writedata;
assign bWriteMask = writemask;
wire [4:0] opcode = instr[6:2];
wire [4:0] rd = instr[11:7];
wire [2:0] func3 = instr[14:12];
reg cond;
wire signed [31:0] rs1_s = rs1;
wire signed [31:0] rs2_s = rs2;
wire signed [31:0] imm_s = imm;
wire [31:0] add_result;
wire [31:0] sub_result;
wire [63:0] mul_result;
wire [63:0] muls_result;
wire [71:0] mulsu_result;
wire [31:0] div_result_r, mod_result_r, divs_result_r, mods_result_r;
wire [31:0] div_result, mod_result, divs_result, mods_result;
adder add(rs1, rs2, add_result);
suber sub(rs1, rs2, sub_result);
mult mul(rs1, rs2, mul_result);
mult_s mul_s(rs1, rs2, muls_result);
mulsu mul_su(rs1, {8'b0, rs2}, mulsu_result);
div div(wClk, rs2, rs1, div_result_r, mod_result_r);
div_s divs(wClk, rs2, rs1, divs_result_r, mods_result_r);
assign div_result = (rs2 == 0) ? 32'hffffffff : div_result_r;
assign divs_result = (rs2 == 0) ? 32'hffffffff : divs_result_r;
assign mod_result = (rs2 == 0) ? rs1 : mod_result_r;
assign mods_result = (rs2 == 0) ? rs1 : mods_result_r;
always @(rs1 or rs2 or rs1_s or rs2_s or func3)
case(func3)
0: cond = rs1 == rs2;
1: cond = rs1 != rs2;
4: cond = rs1_s < rs2_s;
5: cond = rs1_s >= rs2_s;
6:cond = rs1 < rs2;
7:cond = rs1 >= rs2;
default: cond = 1'b0;
endcase
always @(posedge wClk)
if (!nwReset) begin
pc <= 32'h00000074;
end else begin
if (state == `RISCVSTATE_EXEC_INST) begin
case (opcode)
5'h1b: pc <= pc + imm;
5'h19: pc <= rs1 + imm;
5'h18: pc <= cond ? pc + imm : pc + 4;
default: pc <= pc + 4;
endcase
end
end
always @(posedge wClk)
if (state == `RISCVSTATE_READ_RS1)
instr <= bReadData;
always @(posedge wClk)
if (state == `RISCVSTATE_EXEC_INST)
if (opcode == 5'h00)
readreg <= rd;
always @(posedge wClk)
if (state == `RISCVSTATE_READ_RS2)
rs1 <= regrddata;
always @(posedge wClk)
if (state == `RISCVSTATE_STORE_RS2)
rs2 <= regrddata;
always @(posedge wClk)
if (!nwReset) begin
write <= 0;
end else if (state == `RISCVSTATE_EXEC_INST) begin
write <= 0;
if (opcode == 5'h08) begin
writeaddr <= rs1 + imm;
writemask <= 4'h0;
writedata <= rs2;
write <= 1'b1;
case (func3)
0: begin
case (writeaddr)
0: begin
writemask <= 4'he;
writedata <= rs2;
end
1: begin
writemask <= 4'hd;
writedata <= {rs2[23:0], 8'b0};
end
2: begin
writemask <= 4'hb;
writedata <= {rs2[15:0], 16'b0};
end
3: begin
writemask <= 4'h7;
writedata <= {rs2[7:0], 24'b0};
end
endcase
end
1: begin
case (writeaddr)
0: begin
writemask <= 4'hc;
writedata <= rs2;
end
1: begin
writemask <= 4'hd;
writedata <= {rs2[23:0], 8'b0};
end
2: begin
writemask <= 4'hb;
writedata <= {rs2[15:0], 16'b0};
end
endcase
end
endcase
end
end else begin
write <= 0;
end
always @(posedge wClk)
if (!nwReset) begin
state <= `RISCVSTATE_INIT_REGX1;
end else begin
case (state)
`RISCVSTATE_INIT_REGX1: state <= `RISCVSTATE_INIT_REGX2;
`RISCVSTATE_INIT_REGX2: state <= `RISCVSTATE_READ_INST;
`RISCVSTATE_READ_INST: state <= `RISCVSTATE_READ_RS1;
`RISCVSTATE_READ_RS1: state <= `RISCVSTATE_READ_RS2;
`RISCVSTATE_READ_RS2: state <= `RISCVSTATE_STORE_RS2;
`RISCVSTATE_STORE_RS2: state <= `RISCVSTATE_EXEC_INST;
`RISCVSTATE_WRITE_RD: state <= `RISCVSTATE_READ_INST;
`RISCVSTATE_EXEC_INST: begin
if (opcode == 5'h00)
state <= `RISCVSTATE_WAIT_LD;
else if (opcode == 5'h08)
state <= `RISCVSTATE_WAIT_ST;
else if (opcode == 5'h0c && instr[25] && func3[2]) begin
state <= `RISCVSTATE_WAIT_DIV;
divclk <= 11;
end else
state <= `RISCVSTATE_WRITE_RD;
end
`RISCVSTATE_WAIT_LD: state <= `RISCVSTATE_WRITE_RD;
`RISCVSTATE_WAIT_ST: state <= `RISCVSTATE_READ_INST;
`RISCVSTATE_WAIT_DIV: begin
if (divclk == 0)
state <= `RISCVSTATE_WRITE_RD;
else
divclk <= divclk - 1;
end
endcase
end
always @(posedge wClk)
if (state == `RISCVSTATE_READ_RS2) begin
case (opcode)
5'h0d: imm <= {instr[31:12], 12'b0};
5'h05: imm <= {instr[31:12], 12'b0};
5'h1b: imm <= {{12{instr[31]}}, instr[19:12], instr[20], instr[30:21], 1'b0};
5'h19: imm <= {{20{instr[31]}}, instr[31:20]};
5'h18: imm <= {{20{instr[31]}}, instr[7], instr[30:25], instr[11:8], 1'b0};
5'h00: imm <= {{20{instr[31]}}, instr[31:20]};
5'h08: imm <= {{20{instr[31]}}, instr[31:25], instr[11:7]};
5'h04: imm <= {{20{instr[31]}}, instr[31:20]};
endcase
end
always @(state or dstreg or dstvalue or bReadData or instr or regrddata or pc)
case (state)
`RISCVSTATE_READ_RS1: begin
regno = bReadData[19:15];
regwren = 0;
regena = 0;
regwrdata = 0;
end
`RISCVSTATE_READ_RS2: begin
regno = instr[24:20];
regwren = 0;
regena = 0;
regwrdata = 0;
end
`RISCVSTATE_WRITE_RD: begin
regwren = (dstreg != 0) ? 1 : 0;
regno = dstreg;
regena = 4'hf;
regwrdata = dstvalue;
end
`RISCVSTATE_INIT_REGX1: begin
regwren = 1;
regno = 1;
regena = 4'hf;
regwrdata = 32'h8c;
end
`RISCVSTATE_INIT_REGX2: begin
regwren = 1;
regno = 2;
regena = 4'hf;
regwrdata = `RAMSIZE * 4 - 16;
end
default: begin
regwren = 0;
regno = 0;
regena = 0;
regwrdata = 0;
end
endcase
always @(posedge wClk)
if (state == `RISCVSTATE_READ_INST) begin
ldaddr <= pc;
end else if (state == `RISCVSTATE_EXEC_INST) begin
if (opcode == 5'h00) begin
ldaddr <= rs1 + imm;
end
end
always @(posedge wClk)
case (state)
`RISCVSTATE_WAIT_LD: begin
dstreg <= readreg;
case (func3)
0: begin
case (ldaddr)
0: dstvalue <= {{24{bReadData[7]}}, bReadData[7:0]};
1: dstvalue <= {{24{bReadData[15]}}, bReadData[15:8]};
2: dstvalue <= {{24{bReadData[23]}}, bReadData[23:16]};
3: dstvalue <= {{24{bReadData[31]}}, bReadData[31:24]};
endcase
end
1: begin
case (ldaddr)
0: dstvalue <= {{16{bReadData[15]}}, bReadData[15:0]};
1: dstvalue <= {{16{bReadData[23]}}, bReadData[23:8]};
2: dstvalue <= {{16{bReadData[31]}}, bReadData[31:16]};
3: dstvalue <= 32'hdeadbeef;
endcase
end
2: dstvalue <= bReadData;
4: begin
case (ldaddr)
0: dstvalue <= {24'b0, bReadData[7:0]};
1: dstvalue <= {24'b0, bReadData[15:8]};
2: dstvalue <= {24'b0, bReadData[23:16]};
3: dstvalue <= {24'b0, bReadData[31:24]};
endcase
end
5: begin
case (ldaddr)
0: dstvalue <= {16'b0, bReadData[15:0]};
1: dstvalue <= {16'b0, bReadData[23:8]};
2: dstvalue <= {16'b0, bReadData[31:16]};
3: dstvalue <= 32'hdeadbeef;
endcase
end
endcase
end
`RISCVSTATE_WAIT_DIV: if (divclk == 0) begin
dstreg <= 0;
case (func3[1:0])
0: begin
dstreg <= rd;
if (rs2 == 0)
dstvalue <= 32'hffffffff;
else
dstvalue <= divs_result;
end
1: begin
dstreg <= rd;
if (rs2 == 0)
dstvalue <= 32'hffffffff;
else
dstvalue <= div_result;
end
2: begin
dstreg <= rd;
if (rs2 == 0)
dstvalue <= rs1;
else
dstvalue <= mods_result;
end
3: begin
dstreg <= rd;
if (rs2 == 0)
dstvalue <= rs1;
else
dstvalue <= mod_result;
end
endcase
end
`RISCVSTATE_EXEC_INST: begin
dstreg <= rd;
case (opcode)
5'h0d: begin
dstvalue <= imm;
end
5'h05: begin
dstvalue <= imm + pc;
end
5'h1b: begin
dstvalue <= pc + 4;
end
5'h19: begin
dstvalue <= pc + 4;
end
5'h04: begin
case (func3)
0: dstvalue <= rs1 + imm;
1: dstvalue <= rs1 << imm[4:0];
2: dstvalue <= (rs1_s < imm_s) ? 1 : 0;
3:dstvalue <= (rs1 < imm) ? 1 : 0;
4: dstvalue <= rs1 ^ imm;
5:
dstvalue <= instr[30] ? (rs1_s >> imm[4:0]) : (rs1 >> imm[4:0]);
6: dstvalue <= rs1 | imm;
7: dstvalue <= rs1 & imm;
default: begin dstreg <= 0; dstvalue<=0; end
endcase
end
5'h0c: begin
if (instr[25]) begin
case (func3)
0: begin
dstvalue <= muls_result[31:0];
end
1: begin
dstvalue <= muls_result[63:32];
end
2: begin
dstvalue <= mulsu_result[63:32];
end
3: begin
dstvalue <= mul_result[63:32];
end
default: begin
dstreg <= 0;
dstvalue <= 0;
end
endcase
end else begin
case (func3)
0: begin
if (instr[30])
dstvalue <= sub_result;
else
dstvalue <= add_result;
end
1: begin
dstvalue <= rs1 << rs2[4:0];
end
2: begin
dstvalue <= (rs1_s < rs2_s) ? 1 : 0;
end
3: begin
dstvalue <= (rs1 < rs2) ? 1 : 0;
end
4: begin
dstvalue <= rs1 ^ rs2;
end
5: begin
if (instr[30])
dstvalue <= rs1 >> rs2[4:0];
else
dstvalue <= rs1_s >> rs2[4:0];
end
6: begin
dstvalue <= rs1 | rs2;
end
7: begin
dstvalue <= rs1 & rs2;
end
endcase
end
end
default: begin
dstreg <= 0;
dstvalue <= 0;
end
endcase
end
endcase
always @(state or pc or opcode or imm or rs1) begin
wRead = 0;
bReadAddr = 0;
if (state == `RISCVSTATE_READ_INST) begin
wRead = 1;
bReadAddr = pc;
end else if (state == `RISCVSTATE_EXEC_INST) begin
if (opcode == 5'h00) begin
bReadAddr = rs1 + imm;
wRead = 1;
end
end
end
endmodule
总共也就500多行verilog代码,应该很容易读懂了。
15.4 结论
RISC-V CPU核以及DE1-SOC的顶层模型代码一起综合的结果如下: 占用的寄存器比较多一点,主要是采用了12级流水线实现了两个除法器(带符号和无符号的)。逻辑总共占了8%。使用了10个DSP块,存储器主要是内存64Kb和寄存器文件1Kb。还是可以接受的,修改一下可以实用了。下面是时序报告,最坏情况可以达到85MHZ,按照100MHZ时钟综合后下载到FPGA开发板跑起来好像没有发现问题。 后面的章节中我们会做一些实用化的改造出来。
【请参考】 01.HDL4SE:软件工程师学习Verilog语言(十四) 02.HDL4SE:软件工程师学习Verilog语言(十三) 03.HDL4SE:软件工程师学习Verilog语言(十二) 04.HDL4SE:软件工程师学习Verilog语言(十一) 05.HDL4SE:软件工程师学习Verilog语言(十) 06.HDL4SE:软件工程师学习Verilog语言(九) 07.HDL4SE:软件工程师学习Verilog语言(八) 08.HDL4SE:软件工程师学习Verilog语言(七) 09.HDL4SE:软件工程师学习Verilog语言(六) 10.HDL4SE:软件工程师学习Verilog语言(五) 11.HDL4SE:软件工程师学习Verilog语言(四) 12.HDL4SE:软件工程师学习Verilog语言(三) 13.HDL4SE:软件工程师学习Verilog语言(二) 14.HDL4SE:软件工程师学习Verilog语言(一) 15.LCOM:轻量级组件对象模型 16.LCOM:带数据的接口 17.工具下载:在64位windows下的bison 3.7和flex 2.6.4 18.git: verilog-parser开源项目 19.git: HDL4SE项目 20.git: LCOM项目 21.git: GLFW项目 22.git: SystemC项目
|