Branch Prediction #1308
Replies: 5 comments
-
|
Wally is also intended for ASIC implementation and not just FPGA. Using a synchronous RAM makes the read path compatible with SRAM. Given the large disparity in bit density we decided to optimize for area. The same trade off exists for both the instruction and data caches. Fortunately we work around the extra cycle of latency so that the BTB and the direction predictor output their prediction for the matching instruction in the Fetch stage. The next PC (PCNextF) is sent to the I$ and the branch predictor before the rising clock edge so that during the Fetch stage we get the prediction result for the corresponding instruction without any delay. In other words we aren't sending PCF to the branch predictor. |
Beta Was this translation helpful? Give feedback.
-
|
Hello, thank you for the response. That means this could be seen as a prefetch branch prediction mechanism? |
Beta Was this translation helpful? Give feedback.
-
|
It's not so much as a prefetch but as the address setups the cycle before. It's functionally equivalent to a flip-flop or LUTRAM asynchronous read with PCF driving the read address port of the branch predictor. You could think of it as if the register was pushed from the PCNextF to PCF register into the Branch Predictor's SRAM. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you very much for the explanation, you have helped me a lot. Thanks for the great work—you are an awesome team! |
Beta Was this translation helpful? Give feedback.
-
|
You are welcome. Happy to help anytime. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I really like your Core Wally design as a reference for structuring an SoC.
However, I noticed that you are using synchronous RAM for branch prediction, which introduces a one-cycle delay in obtaining the prediction. Why didn't you use LUTRAM (distributed RAM) instead, which would allow for an immediate prediction in the fetch stage?
I haven't simulated your SoC, but I have analyzed the code. Does it really make sense to always lose one cycle for branch prediction in a 5-stage pipeline CPU? Or did I misinterpret the code?
I understand that with BRAM, you get more lines than with LUTRAM. Why did you decide to do it this way, if my assumption is correct?
Beta Was this translation helpful? Give feedback.
All reactions