-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wishbone burst access #693
Comments
Hi there @stokdam! But regarding wishbone registered feedback mode and pipelined: I'd propose we look into pipelined. It seems, that according to ZipCPU, to be simple but efficient. Only problem that an eventual error on the bus can no longer (safely) be assigned to an individual read or write but only to the whole transaction that was in flight. Meaning that for example a burst of 16 bytes that triggers a slave error after 4 bytes may have to be restarted from byte 0. Some more comments/ideas:
Do you @stokdam have by any chance experience with pipelined Wishbone? |
Hi @NikLeberg!
I do not undertand this point. Aren't individual accesses indepentent from each other with pipelined approach? Slave just assumes master want always perform an incremental burst (and stall the pipeline if it is actually not), but can raise an err/ack for each transferred word.
Unfortunately I discovered Wishbone within this project, always worked with AMBA AHB/APB buses. EDIT:
A non-pipelined slave can interface with a pipelined master by properly asserting the stall signal (Paragraph 5.2 in WB specs) |
I did check the HDL and found that almost all slaves on the internal bus seem to only ever have a latency of 1 from
Masters that could profit from a burst access are:
@stnolting can you validate/check my findings? To support pipelined bus transactions the internal bus does not necessarily need to be implemented as a wishbone bus. But at least two more signals would need to be added. Something similar to the wishbone |
You don't need to put any extra-cycle, the concept of pipeline here means that you can assert a new request before the following one has been acknowledged. However, the slave is allowed to answer in the clock cycle just after the one in which STB is asserted.
Mmm, I don'l like the possibility to push an undefined number of transactions before getting any acknoledge. I think WB does not forbid this since it wants to be as flexible as possible. AHB, for example, provides the HREADY signal that fuctions as acknowledge as well as stall request, thus allowing a maximum of one "pending" transaction. I think it would be overkill to support multiple pending transactions on the master side, and also not so useful since it would not guarantee a better throughput. This means that, on the master side, also the signal ACK_I plays some kind of stall request functionality. There are two possibilities:
In order to implement the latter, STB signal must be combinatorially generated starting from ACK_I, however, it would allow an easer bridging with AMBA buses |
Hey everyone! Burst accesses would be a great thing to have! My current evaluation setup uses an external DDR RAM, which is horrible slow when using single accesses. Hence, I am using a cache IP block between the core's Wishbone interface and the actual SDRAM controller. Adding burst support to the NEORV32 itself has been on my to do list for quite some times. But there are some difficulties here. As @NikLeberg has already correctly identified there are several modules that could benefit from burst accesses:
To make things even more complicated all the burst-requesting modules would need to be able to handle all kind of responses, especially:
Adding a dedicated stall signal like the Wishbone protocol suggests is something I would like to avoid as this can result in a very long combinatorial feedback paths making timing closure more complicated (image a large Wishbone networks with lots of muxes and switches; the stall signal would require to be passed asynchronously through all elements).
I agree. Actually, I am reworking the internal bus protocol right now to make it more Wishbone-alike (replace neorv32/rtl/core/neorv32_package.vhd Lines 158 to 168 in 8779e85
I like the idea to add a buffer to the Wishbone interface that converts bursts to single access and vice versa, but I am not sure if that would really help to increase performance as the the transfers between the Wishbone port and the caches would still be single accesses only. This is a complex topic and I am happy for any kind of help/feedback/ideas! 👍 Adding burst to the system will help increase performance - especially when talking to external memory. |
That is true. I think the best option would be to let the bus slave assume a specific burst mode. Meaning a FIFO or register may assume that a burst has a constant address. A good old chunk of memory may assume incrementing access. If that assumption was not correct then the slave could stall the producer and correct its mode. Whatever that would mean for the slave...
I don't really see a particular problem with non naturally aligned bursts... Can't it just issue a burst starting from the address it wants? It's up to the slave if he can handle that or not. A slave can always just fall back to the single transfer pattern.
How about skid buffers? They provide the functionality of a stall signal but break up the combinatorial path. They basically register the stall signal and temporarily store the in-flight message that would get dropped due to the stall. Dan's implementation seems to have no added latency when the slave is non-stalled. Having a stall like signal is unavoidable I think... 🤔
I like it! To support bursts we basically just need to add a
I think this option would be more about not having to change the whole system at once. By putting the simpler components that do not profit from burst access that much behind a simple converter we could reduce the changes necessary. Please correct me @stnolting but if I'm not mistaken then the CPU waits for a write to
|
Stalling again... 😅 I really don' t fell well with that. All bus controllers would need to support stalling, which would require additional logic. Furthermore, what if a faulty device never ever releases the stall signal again? Is this something the bus monitor would need to keep track of?
Naturally aligned memory blocks are easier to handle in terms of hardware arbitration logic (just mask out log2(size_of_block) bits in the address). And I also think that DDR controller would suffer from unaligned bursts as several banks would need to be opened and closed. 🤔
I'll have a look at this. But how do they achieve zero delay on unstalled accesses? Is there a latch somewhere? I think a plain FIFO would be much more generic (and maybe even cleaner) if we want to relax the critical path for bus systems that support stalling.
🙈 😅
For the new bus protocol the
The important question is (again): which modules would benefit from burst accesses? The CPU requires two cycles to fetch access memory (at least). If we just access internal memory then there is no real gain when having bursts - even the caches are not bringing any performance boosts here. If we ignore the DMA for a moment then only the XIP and the Wishbone interface (or external memory) would benefit from bursts/caches.
That's correct. By having only one memory request in flight the CPU always knows which access went rogue and can issue precise exceptions.
That's a good point I was not thinking about...
To be honest I would rather keep the "safety" approach then optimizing for maximum bus throughput. This safety thing has become something like a niche application of this core. 😉 |
I'm opening this issue since I have not found any dedicated thread on this topic.
The topic has been discussed here #573 (comment)
I'm going to synthesize it in the following lines.
As stated by @stnolting in #573 (comment), bus accesses are currently the main performance bottlneck of the architecture. This is because any request will be asserted only after the previuos one has been acknowledged. Especially with high latency memory, this behavior is crushing performances. For example SDRAMs have a certains CAS latency, which is added to each bus transaction. However, most SDRAMs offer burst accesses, which means that you must wait CL only once for the entire burst (e.g. 8 words), then you receive 1 word per clock cycle. This behavior suits well with caches, in particular with instruction cache.
@NikLeberg, in #573 (reply in thread), proposed two ways in which burst accesses can be implemented:
The advantage of this approach is that wrapping is actually supported by SDRAM bursts. It can be useful to reduce first word access latency after a miss. For example, if address 0x5 is required, you start an 8-word burst access at address 0x5; after receiving address 0x7, the burst wraps around and sends you address 0x0, until 0x4 which is the last one. This allows the icache to always perform bursts that are aligned with cache blocks, while still minimizing miss penalty. The disadvantage is the necessity to implement the above-mentioned control signals.
This approach may be easier to implement on the CPU side.
The text was updated successfully, but these errors were encountered: