-
Notifications
You must be signed in to change notification settings - Fork 137
Assignment 4
| ** Note: this assignment is finished.** |
|---|
In this assignment you will upgrade our single-cycle implementation to scalar MIPS simulator with constant latency for instruction execution.
All requirements remain the same as in previous tasks
You should create a new branch task_4 and create four new files:
perf_sim/perf_sim.h
perf_sim/perf_sim.cpp
perf_sim/main.cpp
perf_sim/Makefile| Hint: We suggest you to start with copy func_sim.h, func_sim.cpp files |
|---|
In previous task you've implemented single-cycled MIPS. But, all actions were incapsulated in to 5 stages:
- Fetch
- Decode, read sources
- Execute, calculate address
- Memory access
- Writeback, PC update, information dump
Now we're going to incapsulate each stage to modules connected with ports.
For simpilicty, all ports and modules will be stored in one class PerfMIPS.
We will use ports for two purposes:
- Data port transfers data from one stage to the next one
- Stall port signals that pipeline is stall to previous stages
In this task, we don't use complicated port topology, so use constants PORT_BW, PORT_FANOUT and PORT_LATENCY everywhere. They must be defined as 1.
Data ports must have following syntax:
class PerfMIPS {
ReadPort</*Type*/>* rp_/*source_module*/_2_/*dest_module*/;
WritePort</*Type*/>* wp_/*source_module*/_2_/*dest_module*/;
// examples
ReadPort<FuncInstr> rp_decode_2_execute;
ReadPort<FuncInstr> rp_execute_2_memory;
WritePort<uint32> wp_fetch_2_decode;
WritePort<FuncInstr> wp_decode_2_execute;
};and be initialized in a following way:
PerfMIPS::PerfMIPS() {
// example
rp_decode_2_execute = new ReadPort<FuncInstr>("DECODE_2_EXECUTE", PORT_BW, PORT_FANOUT);
wp_decode_2_execute = new WritePort<FuncInstr>("DECODE_2_EXECUTE", PORT_LATENCY);
}Each pair of data ports has to transmit FuncInstr object. The only exception is fetch->decode port which transmits raw uint32.
Stall port is used to stop previous stages if this stage can not be passed by current instructions and has to be re-started.
These ports must transmit only one 1 bit of data presented in bool type.
ReadPort<bool>* rp_decode_2_fetch_stall;
WritePort<bool>* wp_decode_2_fetch_stall;
rp_decode_2_fetch_stall = new ReadPort<bool>("DECODE_2_FETCH_STALL", /**/);For unification, you're recommended to name modules this way:
fetchdecodeexecutememorywriteback
Each module consists of following objects:
- read port from the previous stage
* - write port to the next stage
** - stall read port from the next stage
** - stall write port to the previous stage
* - internal value on the latch — FuncInstr object or data bytes
* -
void clock_module(int cycle)function (wheremoduleis name above)
*Is not needed onfetchmodule.
**Is not needed onwritebackmodule.
void clock_module( int cycle) {
bool is_stall;
/* If the next module tells us to stall, we stops
and send stall signals to previous module */
rp_next_2_me_stall->read( &is_stall, cycle);
if ( is_stall) {
wp_me_2_previous_stall->write( true, cycle);
return;
}
/* If nothing cames from previous stage
execute, memory and writeback modules have to jump out here */
if ( rp_previous_2_me->read( &module_data, cycle))
return;
/* But, decode stage doesn't jump out
It takes non-updated bytes from module_data
and re-decodes them */
// rp_previous_2_me->read( &module_data, cycle)
// Here we process data.
if (...) {
/* This branch is chosen if everything is OK and
we may continue promotion to the next pipeline stages */
wp_me_2_next->write( module_data, cycle);
}
else {
// Otherwise, nothing is done and we have to stall pipeline
wp_me_2_previous_stall->write( true, cycle);
}
} | Note: Decode stage behavior is slightly different from other modules, pay attention to code options |
|---|
In this assignment we assume that every instruction is executed in 1 cycle, so the only possible stalls are caused by data dependency and control dependency.
| Note: We DO NOT model "long" instructions, load/store misses in this task. Every instruction that reaches execution unit, leaves it on the next cycle! |
|---|
Our goal is to stop instruction if its sources are not ready.
It can be checked by following extension of RF: each register is extended by 1 validity bit.
For instruction's destination register, this bit is set to false on decoding stage, and returned back to true on the writeback stage.
Next instructions must check the bits of their sources. If and only if they are in true state, this instruction can continue execution, otherwise it is stalled.
Note: Because $zero register is never overwritten, its validity bit is always in true state! |
|---|
The code changes should look like:
class RF {
struct Reg {
uint32 value;
bool is_valid;
Reg() : value(0ull), is_valid(true) { }
} array[REG_MAX_NUM];
public:
uint32 read( Reg_Num);
bool check( Reg_Num num) const { return array[(size_t)num].is_valid; }
void invalidate( Reg_Num num) { array[(size_t)num].is_valid = false; }
void write ( Reg_Num num, uint32 val) {
// ...
assert( array[(size_t)num].is_valid == false);
array[(size_t)num].is_valid = true;
}
};| Note: We ARE NOT going to model out-of-order execution, superscalar CPU etc. — please, do not invent "scalable" solutions, working scalar MIPS will be more than enough |
|---|
Control dependency can be represented as a data dependency via PC register.
You have to add validity bit for PC register that is set to false by jumps and branches — they must be detected with FuncInstr::is_jump() const method.
But, this bit have to be checked not on decode, but on fetch stage.
Note: Non-branch instructions must promote PC by 4 at the decoding stage to continue fetch of next instructions! |
|---|
At each stage, the instruction disassembly (if exists) and its result (if exists) should be printed to the std::cout in the way similar to functional simulator, but preceeded by the stage name and current clock number separated "\t" sign:
Sometimes it is very useful to see what happens inside the machine. One of simpliest ways is per-stage output: simulator shows instruction being proceeded at each stage, like this:
fetch cycle 5: 0x43adcb90
decode cycle 5: ori $t2, $t1, 0xAA00
execute cycle 5: add $t1, $t2, $t3
memory cycle 5: bubbleYou are free to add IPC/CPI counters output in the end of simulation.
In silent output mode, the output must be equal to the FuncSim's output, e.g. it doesn't contain cycle prefixes, IPC counters etc; only writeback stages produces output.
As in functional simulator, run has 2 parameters
-
const std::string& trwith file system path to the trace to execute -
int instrs_to_runwith amount of instructions to be performed
and one extra parameter
-
bool silent— see above
The code inside must be very simple:
PerfMIPS::run(...) {
// .. init
executed_instrs = 0; // this variable is stored inside PerfMIPS class
cycle = 0;
while (executed_instr <= instrs_to_run) {
clock_fetch(cycle);
clock_decode(cycle);
clock_execute(cycle);
clock_memory(cycle);
clock_writeback(cycle); // each instruction writeback increases executed_instrs variable
++cycle;
}
// ..
}Question: Can calls of clock_fetch and clock_decode be swapped? What about clock_writeback and clock_fetch? |
|---|
Entry point has to be very similar to the FuncSim's one, but you have to support -d option that disables "silent output mode".
As you have probably guessed, silent mode is required to quickly compare FuncSim and PerfSim outputs.
The best way to validate performance simulator is to compare its output to the functional simulator's one.
We provide a script ./run_and_compare.sh that performs build and launch of func_sim and perf_sim.
Its syntax is similar to simulators:
./run_and_compare.sh <test name> <instructions amount>If everything is correct, script will print "Tests passed' to the screen and finish. Otherwise, vim showing differences between FuncSim and PerfSim traces will be started.
| Note: Please, look inside that script and try to understand it's behavior. Bash scripts can be very useful in development process. |
|---|
MIPT-V / MIPT-MIPS — Cycle-accurate pre-silicon simulation.