the two instructions that are fetched during decode and compute of first instruction (a branch) have to be discarded
this two cycle-delay — “branch penalty”
to reduce the penalty, branch target address must be computed earlier than pipeline — in the decode stage
this reduces the penalty to one cycle:
this needs hardware modification — PC has to be incremented in every cycle, and a second adder is needed in decode stage to compute branch target address for every instruction
branch condition must be tested as early as possible comparator to test condition can be moved to decode stage it would use values from register file outputs A and B directly
branch delays slot — the location that follows a branch instruction
compiler tries to find an instruction that it always executed, independent of whether or not the program branches
data dependencies must be preserved if the compiler can find a useful instruction, there’s no branch penalty otherwise, it NOPs out and there’s a penalty of one cycle
Static:
Dynamic: