Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][DNM] DMM: Skeleton for a DMA Memory Management subsystem #57602

Closed
wants to merge 1 commit into from

Conversation

carlocaione
Copy link
Collaborator

@carlocaione carlocaione commented May 5, 2023

This is a hopefully-to-be-discussed proposal for an actual implementation of what was discussed in #57220 (please read that before keep reading).

TL;DR: Zephyr is currently missing a way to allocate and prepare buffers to be used by DMA engines and DMA-enabled peripherals on platforms with a complex memory organization. This PR is a proposal for a subsystem interface dealing with that.

In memory bus architectures, CPU running software is one of the masters
accessing the bus, while DMA engines are other masters accessing the
same (or other, but connected) bus.

In basic MCUs there is a simple bus architecture, where CPU and DMA
engines can access the whole memory range, no caches are present, and
memory addressing is consistent between all the bus masters.

The memory access in these devices can be summarized with the following
rules:

 * CPU and DMA have access to the whole memory space
 * memory address used by CPU to point to any memory location A is equal
   to the memory address used by DMA to point to the memory location A
 * any data stored in memory by DMA is immediately visible to the CPU
 * any data stored in memory by the CPU is immediately visible to DMA

In more complex systems with a more complex memory architecture the
interaction between CPU and DMA, or between multiple CPUs, can be
complicated:

 * DMA can be restricted to access only part of the system memory (or
   only part of the memory can be accessed efficiently by DMA)
 * CPU may need to translate its buffer address (virtual or physical) to
   an address (or a set of addresses if the buffer is not continuous in
   the bus memory space) usable by DMA before passing it to a DMA engine
   (to take into account cache line alignment, memory region allocation,
   etc...)
 * CPU cache can contain stale data and CPU may need to invalidate cache
   to read the data that DMA or other CPU updated in the memory
 * Data intended by a CPU to be stored in memory can get stuck in a
   CPU's write-back cache and be invisible to DMA or other CPUs so the
   CPU has to flush caches before informing DMA or other CPUs about data
   availability in the memory

All of the discussed challenges must be addressed by the software
running in a system with a complex memory architecture.

In this patch I'm proposing to add a new sub-system called DMM - DMA
Memory Management with the following responsabilities:

 * allocating and freeing "bounce buffers" if the buffer provided by a
   driver user cannot be used by the DMA engine (memory accessible by
   DMA, aligned to DMA requirements, if cacheable: aligned and padded to
   cache lines),
 * copying data to and from the bounce buffers
 * translation of CPU buffer addresses (virtual or physical) to
   DMA-usable addresses
 * cache management if buffers are allocated in cacheable memory

Signed-off-by: Carlo Caione <[email protected]>
@carlocaione carlocaione marked this pull request as ready for review May 5, 2023 13:01
@carlocaione carlocaione linked an issue May 5, 2023 that may be closed by this pull request
@carlocaione carlocaione self-assigned this May 5, 2023
@carlocaione carlocaione added RFC Request For Comments: want input from the community DNM This PR should not be merged (Do Not Merge) labels May 5, 2023
* - copy the data from the user-provided buffer to the bounce buffers
* - translate CPU buffer addresses to DMA addresses if needed
* - do the proper cache management if buffers are allocated in cacheable
* memory (for example flushing cache lines associated with the memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: flushing is not necessary

Suggested change
* memory (for example flushing cache lines associated with the memory
* memory (for example writing back cache lines associated with the memory

* memory (for example flushing cache lines associated with the memory
* backing the DMA accessible buffers)
*
* @param[in] dev Device pointer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which device is this? The device that is to perform DMA transfer from the buffer pointed by dma_handle?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that is the idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this information somewhere to the docstring to make it clear?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do.

* backing the DMA accessible buffers)
*
* @param[in] dev Device pointer
* @param[in] flags Flags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these flags defined somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, TBD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify here sth like

Suggested change
* @param[in] flags Flags
* @param[in] flags Flags reserved for future use

* This function is used to translate the user-provided buffer into something
* that is usable for a DMA outbound data transfer and return that to the user
* as an opaque DMA handle (usually a physical/bus address).
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a paragraph that the user cannot access buffer pointed by buffer or dma_handle from the time this function is called until dmm_release_buffer_for_dma_out() is called. Access includes reading from this buffer.

Also I would clarify that dmm_release_buffer_for_dma_out() must be called after the DMA transfer is finished. Not calling this function can lead to memory leaks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment on lines +79 to +80
* @param[in] dev Device pointer
* @param[in] flags Flags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar ambiguity like in the outbound transfer function

* This function is used to translate the user-provided buffer into something
* that is usable for a DMA inbound data transfer and return that to the user
* as an opaque DMA handle (usually a physical/bus address).
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a paragraph that the user cannot access buffer pointed by buffer or dma_handle from the time this function is called until dmm_release_buffer_for_dma_in() is called.

Also I would clarify that dmm_release_buffer_for_dma_in() must be called after the DMA transfer is finished, before data is read by CPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Collaborator

@teburd teburd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inherent problem seems to be someone wants to use a buffer that isn't aligned or in the right region for DMA. Shouldn't we make that easier to sort out statically rather than defining a new allocation interface for this task?

Can we use device tree to describe the requirement for a buffer with a particular peripheral + dma pair I wonder. Something like...
DT_SHARED_BUF(dma0, spi0) uint8_t some_spi_buf[128];
Where all the correct compiler attributes are supplied for alignment, placement, and size.

@andyross
Copy link
Contributor

andyross commented May 5, 2023

Can we use device tree to describe the requirement for a buffer

That idea I like very much. Puts a lot of preprocessor work onto the platform/controller layer, but cleanly extends to all sorts of nonsense like alternate memory spaces (even madness like iommu's or 24 bit ISA DMA!) that are otherwise really hard to abstract. And it falls back to obvious behavior for "it's just memory" systems that tend to be the common case.

@hubertmis
Copy link
Member

hubertmis commented May 5, 2023

@teburd, do I understand correctly that you suggest to move the responsibility to manage memory buffers from a device driver to its user? If that's the case, let's consider following application code:

struct spi_buf_set prepare_spi_tx_buffer(void) {
  static const struct spi_buf buf = {
    .buf = "Hello world",
    .len = sizeof("Hello world"),
  };

  struct spi_buf_set set = {
    .buffers = &buf,
    count = 1,
  }

  return set;
}

If SPI device driver takes care of copying buf content to a bounce buffer accessible by SPI's DMA, like @carlocaione suggested, then this code is portable. However, if the SPI API user was responsible for managing memory buffers, it would need to be modified to sth like:

struct spi_buf_set prepare_spi_tx_buffer(void) {
  static const struct spi_buf buf = {
    .buf = allocate_spi_buf(sizeof("Hello world")),
    .len = sizeof("Hello world"),
  };

  if (!buf.buf) {
    // Handle fault
  }

  memcpy(buf.buf, "Hello world", sizeof("Hello world"));

  cache_write_back(buf.buf, sizeof("Hello world"));

  buf.buf = translate_cpu_address_to_bus_address(buf.buf);

  struct spi_buf_set set = {
    .buffers = &buf,
    count = 1,
  }

  return set;
}

void release_spi_tx_buffer(struct spi_buf_set set) {
  void *cpu_address = translate_bus_address_to_cpu_address(set.buffers[0].buf); // If such translation is possible
  free_spi_buf(cpu_address);
}

Portability is an important software attribute, especially in hardware agnostic projects like Zephyr. That's why I think the responsibility for managing memory buffers for DMA should be hidden from application code and implemented in device drivers.

Applications should have a possibility to improve performance and use buffer allocation methods meeting buffer memory management requirements to prevent device drivers from copying data to bounce buffers. But it should be an option, not a requirement for an application.

@andyross
Copy link
Contributor

andyross commented May 5, 2023

@hubertmis the problem with that kind of scheme is that once we decide on a dynamic allocation API we can't go back. This isn't linux, and we don't get to assume we're running on devices with copious virtual memory available for stuff. Existing Zephyr paradigms work very hard (to be fair, occasionally too hard) to enable static allocation for everything an application might want.

@teburd's idea is very much in that vein: it's an API that allows an app to say "I want a buffer that will work for DMA with this device", and get it, without having to do much more than know that the device exists and call whatever acquire/release API we decide on around code that mutates it.

@hubertmis
Copy link
Member

@teburd , could you draft your proposal? I would like to understand it better to be able to compare properties of both solutions

@carlocaione carlocaione requested a review from teburd May 11, 2023 07:56
@teburd
Copy link
Collaborator

teburd commented May 11, 2023

@teburd , could you draft your proposal? I would like to understand it better to be able to compare properties of both solutions

I think looking at the imxrt 685 is useful here, as it has many bus masters including two dmas, usb hs, sdhc, and one arm m33 core, one xtensa hifi4 core.

E.g. lets consider configuring dma0 with i2s0 (flexcomm) and now add in some buffer requirements to a device tree node...

i2c0: &flexcomm1 {
    status = "okay";                                                                                                                                                                                                                                                                 
    compatible = "nxp,lpc-i2s";                                                                                                                                                                                                                                                      
    #address-cells = <1>;                                                                                                                                                                                                                                                            
    #size-cells = <0>;                                                                                                                                                                                                                                                               
    dmas = <&dma0 2>;                                                                                                                                                                                                                                                                
    dma-names = "rx";
    buf-align = <4>; /* byte alignment */
    buf-size-align = <4>; /* byte size alignment, e.g. must be a multiple of X bytes in length */
    buf-region = <&sram0 4>; /* choose a bank of sram for placement, or maybe this uses a span of addresses in the sram, but usually its banks in the data sheets */
    buf-cache = <cache-flush cache-invalidate>; /* What to do when shared with the device and from the device */                                                                                                                                                                                                                                                       
    pinctrl-0 = <&pinmux_flexcomm1_i2s>;                                                                                                                                                                                                                                             
    pinctrl-names = "default";
}

Then build some macros

/**
 * @brief Shared buffer
 */
#define DT_SHARED_BUF(bus_ctlr, bus_periph) __attributes__(aligned(DT_PROP_OR(buf_periph, buf-align, DT_PROP_OR(buf_ctlr, buf-align, WORD_ALIGN))), section(DT_SRAM_SECTION_OR(buf_periph, buf-region, DT_SRAM_SECTION_OR(bus_ctlr, buf-region, BUF_DEFAULT_SECTION)))

/* Later on can be used, without requiring bounce buffers... */
DT_BUF(dma0, i2s0) uint16_t audio_rx_buf[128];

/* Configure receive with above described buffer */
dma_config(dma_dev, ....);

I'm not being very precise here, and I very much doubt any of this works... I hope the idea is clear here though.

Describe what the hardware demands of a buffer in device tree, use that to setup statically allocated buffers or block allocators. That's it. Caching could be a totally separate thing or maybe its considered a sram region thing (often you can configure uncached/cached regions). Or perhaps the buffer is placed in a struct container with some metadata about caching and state. Would be interesting to look at the sram bank properties if provided in DT as the source of truth on this.

No allocation of bounce buffers needed, and tests that utilize DMA no longer have to have ifdefs for part specifics about this sort of thing as they do today. Caching is understood by the peripheral/dma/cpu in some manner to flush and invalidate when most appropriate or by a new API like you've proposed here.

@teburd
Copy link
Collaborator

teburd commented May 11, 2023

@hubertmis
Copy link
Member

Thanks @teburd for these details. I like the idea of describing memory buffer requirements in the device tree and creating macros to easily create buffers meeting them.

It looks the main open question in this discussion is about the device driver model. I don't think Zephyr's device driver model currently puts any requirements on the memory buffers passed to the drivers' functions by their users: users can pass any buffers allocated in any memory (writable or not, cacheable or not, aligned or not, dynamically or statically allocated, ...) and they expect the drivers to correctly perform requested operations using the passed memory buffers.

It seems we are discussing several options on how to incorporate DMA requirements regarding used memory buffers in a unified way for all device drivers in Zephyr:

  1. Keep the device driver model as it is. Create responsibility for drivers of devices with DMA with specific requirements to meet these requirements.
  2. Modify the device driver model to require driver users to pass to the drivers only the memory buffers created with a designated (static or dynamic) allocator.
  3. Keep the device driver model as it is. Create an opportunity for applications to allocate buffers meeting DMA requirements. Create responsibility for device drivers to check if the passed buffer meets the requirements, if not the driver would be responsible for copying the buffer's content to a bounce buffer.

The proposed device tree properties and macros can be used in each of these solutions. The difference would be which layer in the system would be able and/or responsible to use the macros to allocate the buffers correctly.

Is this summary correct? Do you happen to see any other options I didn't list?

When we decide the list of options is around being completed, we should start listing properties of these options that would help us to find out which one is the most aligned with Zephyr's needs.

@teburd
Copy link
Collaborator

teburd commented May 12, 2023

Thanks @teburd for these details. I like the idea of describing memory buffer requirements in the device tree and creating macros to easily create buffers meeting them.

It looks the main open question in this discussion is about the device driver model. I don't think Zephyr's device driver model currently puts any requirements on the memory buffers passed to the drivers' functions by their users: users can pass any buffers allocated in any memory (writable or not, cacheable or not, aligned or not, dynamically or statically allocated, ...) and they expect the drivers to correctly perform requested operations using the passed memory buffers.

It seems we are discussing several options on how to incorporate DMA requirements regarding used memory buffers in a unified way for all device drivers in Zephyr:

1. Keep the device driver model as it is. Create responsibility for drivers of devices with DMA with specific requirements to meet these requirements.

I think this is mostly the right approach, the interfaces mostly remain the same, and with asserts enabled it should be considered a programming error to give an invalid buffer pointer.

2. Modify the device driver model to require driver users to pass to the drivers only the memory buffers created with a designated (static or dynamic) allocator.

This could be accomplished by supplying asserts and perhaps helper macros to validate buffer X
is aligned and placed in the right location. We have a great cache API now with the work of @carlocaione that drivers can use to flush/invalidate buffer passed in or out.

3. Keep the device driver model as it is. Create an opportunity for applications to allocate buffers meeting DMA requirements. Create responsibility for device drivers to check if the passed buffer meets the requirements, if not the driver would be responsible for copying the buffer's content to a bounce buffer.

A bounce buffer is the wrong approach in my humble opinion. It should be a programming error if this is needed.

The proposed device tree properties and macros can be used in each of these solutions. The difference would be which layer in the system would be able and/or responsible to use the macros to allocate the buffers correctly.

I would place the responsibility for correctly allocating buffers on the application. With helpers for setting up static buffers and block allocators. Assertions can validate that the buffer pointer is aligned and in the correct region.

Is this summary correct? Do you happen to see any other options I didn't list?

When we decide the list of options is around being completed, we should start listing properties of these options that would help us to find out which one is the most aligned with Zephyr's needs.

I think those options are right. I believe pretty strongly that memory should be managed by an application for Zephyr. Any sort of automatic memory management should be entirely opt in and not the default. It's unclear why a bounce buffer would ever be needed in Zephyr if the buffers are correctly setup, and that should be a goal.

@hubertmis
Copy link
Member

hubertmis commented May 15, 2023

@teburd, thanks for your comment.

User code with modified device driver model

I'm looking at selected Zephyr samples to assess changes needed if we decided to modify the device driver model so that the application is responsible for correct buffer allocation.

spi_bitbang

uint16_t buff[5] = { 0x0101, 0x00ff, 0x00a5, 0x0000, 0x0102};
int len = 5 * sizeof(buff[0]);
struct spi_buf tx_buf = { .buf = buff, .len = len };

Currently buff is used directly in the tx_buf structure. I understand with your recommendation this sample (and all other code using SPI driver API) would need to be modified to something like

 DT_BUF(dma0, dev) uint16_t buff[5] = { 0x0101, 0x00ff, 0x00a5, 0x0000, 0x0102}; 
 int len = 5 * sizeof(buff[0]); 
  
 struct spi_buf tx_buf = { .buf = buff, .len = len }; 

  dt_buf_acquire(buff); // To do cache management

I'm not sure if the DT_BUF() macro definition can somehow describe the padding of buff, so it might be necessary to use another macro:

DT_BUF(dma0, dev) uint16_t buff[DT_BUF_SIZE(dma0, dev, 5)] = { ... };

Using sizeof(buff) would not return the intended size of the buffer anymore. This change of the sizeof() return value could result in obscure bugs.

uart/echo_bot

This sample includes both RX and TX paths of UART communication.

RX

It's pretty easy because the following code

static char rx_buf[MSG_SIZE];

would need to be modified to

DT_BUF(dma0, uart_dev) static char rx_buf[DT_BUF_SIZE(dma0, uart_dev, MSG_SIZE)];

assuming these macros would work correctly for global static objects.
And some _acquire() / _release() calls for proper cache management.

TX

We can't use anymore just any char *buf to be printed like in

void print_uart(char *buf)
{
int msg_len = strlen(buf);
for (int i = 0; i < msg_len; i++) {
uart_poll_out(uart_dev, buf[i]);
}
}

To use this kind of API the sample code would need to allocate a bounce buffer meeting the UART TX and use it here:

DT_BUF(dma1, uart_dev) static char tx_buf[DT_BUF_SIZE(dma1, uart_dev, 1)];

void print_uart(char *buf)
{
	int msg_len = strlen(buf);

	for (int i = 0; i < msg_len; i++) {
                tx_buf[0] = buf[I];
                dt_buf_tx_acquire(tx_buf);
		uart_poll_out(uart_dev, tx_buf);
                dt_buf_tx_release(tx_buf);
	}
}

This bounce buffer would be allocated by the application for any platform, even if given UART does not have DMA, or includes a DMA capable of printing any application. This potentially unnecessary bounce buffer makes usage of driver APIs harder and is wasteful.

blinky_pwm

Some hardware peripherals like PWM, even though it's a rare case, can also include DMA. To be consistent with the device driver model and potentially handle DMA here, the API user would need to pass period and pulse arguments to the pwm_set_dt() function in a buffer meeting DMA requirements. So we could again allocate a bounce buffer in the application to do that. However, it's not obvious how the period and pulse fields shall be placed in the buffer, because each hardware peripheral might require a different data structure.

uart/echo_bot again

Assume we have a UART peripheral that uses DMA to read data from RAM in the following structure:

Byte 1: the size of the transfer
Byte 2: the 1st byte to be transferred
Byte 3: the 2nd byte to be transferred
Byte 4: ...

The solution discussed above with DT_BUF(dma1, uart_dev) static char tx_buf[DT_BUF_SIZE(dma1, uart_dev, 1)]; would not work with a peripheral like this. The tx_buf buffer would miss byte 1 containing the size of the transfer. The device driver would anyway allocate another bounce buffer and copy data around to add byte 1 at the beginning.

Conclusion

It seems impossible to impose responsibility on an application to allocate correct buffers for each potential DMA usage in any system. It would require an application (or a subsystem) to structure data in a buffer in such a way that is required for its target platform, rendering the application non-portable.

This makes option 2 I described in #57602 (comment) invalid.

Do you agree?

@teburd
Copy link
Collaborator

teburd commented May 15, 2023

Comments inline, but no I don't agree, and I don't think we've fully explored this. I realized when re-reading this the dma node probably isn't needed in the macros, only the device tree node representing the peripheral. That should alleviate portability concerns. That and providing a define macro rather than macros to generate attributes.

@teburd, thanks for your comment.

User code with modified device driver model

I'm looking at selected Zephyr samples to assess changes needed if we decided to modify the device driver model so that the application is responsible for correct buffer allocation.

spi_bitbang

uint16_t buff[5] = { 0x0101, 0x00ff, 0x00a5, 0x0000, 0x0102};
int len = 5 * sizeof(buff[0]);
struct spi_buf tx_buf = { .buf = buff, .len = len };

Currently buff is used directly in the tx_buf structure. I understand with your recommendation this sample (and all other code using SPI driver API) would need to be modified to something like

 DT_BUF(dma0, dev) uint16_t buff[5] = { 0x0101, 0x00ff, 0x00a5, 0x0000, 0x0102}; 
 int len = 5 * sizeof(buff[0]); 
  
 struct spi_buf tx_buf = { .buf = buff, .len = len }; 

  dt_buf_acquire(buff); // To do cache management

I'm not sure if the DT_BUF() macro definition can somehow describe the padding of buff, so it might be necessary to use another macro:

DT_BUF(dma0, dev) uint16_t buff[DT_BUF_SIZE(dma0, dev, 5)] = { ... };

Using sizeof(buff) would not return the intended size of the buffer anymore. This change of the sizeof() return value could result in obscure bugs.

uart/echo_bot

This sample includes both RX and TX paths of UART communication.

RX

It's pretty easy because the following code

static char rx_buf[MSG_SIZE];

would need to be modified to

DT_BUF(dma0, uart_dev) static char rx_buf[DT_BUF_SIZE(dma0, uart_dev, MSG_SIZE)];

assuming these macros would work correctly for global static objects. And some _acquire() / _release() calls for proper cache management.

TX

We can't use anymore just any char *buf to be printed like in

void print_uart(char *buf)
{
int msg_len = strlen(buf);
for (int i = 0; i < msg_len; i++) {
uart_poll_out(uart_dev, buf[i]);
}
}

To use this kind of API the sample code would need to allocate a bounce buffer meeting the UART TX and use it here:

DT_BUF(dma1, uart_dev) static char tx_buf[DT_BUF_SIZE(dma1, uart_dev, 1)];

void print_uart(char *buf)
{
	int msg_len = strlen(buf);

	for (int i = 0; i < msg_len; i++) {
                tx_buf[0] = buf[I];
                dt_buf_tx_acquire(tx_buf);
		uart_poll_out(uart_dev, tx_buf);
                dt_buf_tx_release(tx_buf);
	}
}

This bounce buffer would be allocated by the application for any platform, even if given UART does not have DMA, or includes a DMA capable of printing any application. This potentially unnecessary bounce buffer makes usage of driver APIs harder and is wasteful.

blinky_pwm

Some hardware peripherals like PWM, even though it's a rare case, can also include DMA. To be consistent with the device driver model and potentially handle DMA here, the API user would need to pass period and pulse arguments to the pwm_set_dt() function in a buffer meeting DMA requirements. So we could again allocate a bounce buffer in the application to do that. However, it's not obvious how the period and pulse fields shall be placed in the buffer, because each hardware peripheral might require a different data structure.

uart/echo_bot again

Assume we have a UART peripheral that uses DMA to read data from RAM in the following structure:

Byte 1: the size of the transfer Byte 2: the 1st byte to be transferred Byte 3: the 2nd byte to be transferred Byte 4: ...

The solution discussed above with DT_BUF(dma1, uart_dev) static char tx_buf[DT_BUF_SIZE(dma1, uart_dev, 1)]; would not work with a peripheral like this. The tx_buf buffer would miss byte 1 containing the size of the transfer. The device driver would anyway allocate another bounce buffer and copy data around to add byte 1 at the beginning.

Then change the macro

/**
 * Defines a buffer with a given symbolic name such that it is aligned, placed, sized, and
 * cached appropriately. The symbol buf_name may not be the actual start address of the buffer
 * if the device requires a prefixed space for metadata about the operation.
 */
#define DT_BUF_DEFINE(node, buf_name, buf_len) \
    DT_BUF_ATTRS(node) uint8_t __buf_name[DT_BUF_LEN(node, buf_len)];
    uint8_t *buf_name = &__buf_name[DT_BUF_OFFSET(node)];

Just like that, I have a macro appropriately created. How could we verify the buffer is correctly done though for that device you ask? A marker (well understood sequence like a stack sentinel, e.g. 0xDEADBEAD) could be prefixed or suffixed and asserted on when CONFIG_ASSERT=y.

Conclusion

It seems impossible to impose responsibility on an application to allocate correct buffers for each potential DMA usage in any system. It would require an application (or a subsystem) to structure data in a buffer in such a way that is required for its target platform, rendering the application non-portable.

That's the whole point of applying macros and device tree to the problem. To make it portable.

This makes option 2 I described in #57602 (comment) invalid.

Do you agree?

No I do not agree, I don't think we've fully explored options to statically allocate buffers correctly.

@hubertmis
Copy link
Member

I have a macro appropriately created

Thanks, indeed it would cover much more cases.

I'll reiterate some and discuss a few more examples to see if we have a common understanding of how this macro would change the Zephyr codebase.

blinky_pwm

I believe if a PWM peripheral uses DMA then the PWM API user could not create a buffer and fill it with period and pulse because the buffer structure is unknown to the user.

It would mean that the allocation of buffers by the API user would be available only for the device drivers that are streaming data (UART, SPI, I2C, USB, radios, ...), but for other drivers, it would be the driver's responsibility to allocate a bounce buffer and fill necessary data like period and pulse.

UART (again) with a dynamic payload offset in DMA buffer

Some peripherals have a dynamic structure of data consumed by DMA. One example would be UART which requires the following structure if the transfer is shorter than 255:

{length, byte 0, byte 1, byte 2, ...}

and following structure if the transfer is longer:

{0xff, length 0, length 1, length 2, length 3, payload 0, payload 1, ...}

Do you think we could somehow capture it in a macro, or would we need to treat such peripherals as exceptions and let device drivers allocate bounce buffers?

UART transferring constant data

Let's discuss an application that transfers constant byte-strings using UART. Currently, the application code would define these byte-strings as something like:

static const uint8_t str1[] = {0xde, 0xad, 0xbe, 0xef};
static const uint8_t str2[] = {0xca, 0xfe, 0xfa, 0xce};

Potentially placing data in NVM where some DMA have access, but others not necessarily.

With the new macro definition, I understand the code would need to be refactored to something like

DT_BUF_DEFINE(dev, str1_buf, sizeof(str1));
DT_BUF_DEFINE(dev, str2_buf, sizeof(str2));

static void copy_buf_content(void) {
  memcpy(str1_buf, str1, sizeof(str1));
  memcpy(str2_buf, str2, sizeof(str2));
}

what is functionally correct. However, it would cause unnecessary RAM buffers allocation and memcpy overhead for platforms where UART can access NVM directly.

spi_flash

SPI flash sample uses SPI (potentially with DMA) indirectly: main -> drivers/flash -> drivers/spi.
I'm not sure which module should allocate SPI DMA buffers in this sample.

I don't think it should be done by main, because main is decoupled from the const struct device *spi - it is aware of const struct device *flash only. So it cannot define a buffer like DT_BUF_DEFINE(spi, spi_buf, size);.

An alternative would be to define a bounce buffer in drivers/flash and copy the payload passed by main, but I understand your goal is to avoid bounce buffers. Do you think we can somehow achieve it in a more complex DMA usage like this one?

@teburd
Copy link
Collaborator

teburd commented May 16, 2023

It seems like there's some special metadata header that needs to exist inline with the data in the cases you've laid out. Fair enough, lets work through it.

I have a macro appropriately created

Thanks, indeed it would cover much more cases.

I'll reiterate some and discuss a few more examples to see if we have a common understanding of how this macro would change the Zephyr codebase.

blinky_pwm

I believe if a PWM peripheral uses DMA then the PWM API user could not create a buffer and fill it with period and pulse because the buffer structure is unknown to the user.

This seems like a pretty special case, and perhaps a specialized PWM set of buffer setup macros would be useful that understand the period and pulse. We have C and while that limits our ability to express some compile time things, I think this could be done with PWM specific macros. Is there some datasheet/ref manual or driver in the tree that does this today I could look at?

It would mean that the allocation of buffers by the API user would be available only for the device drivers that are streaming data (UART, SPI, I2C, USB, radios, ...), but for other drivers, it would be the driver's responsibility to allocate a bounce buffer and fill necessary data like period and pulse.

UART (again) with a dynamic payload offset in DMA buffer

Some peripherals have a dynamic structure of data consumed by DMA. One example would be UART which requires the following structure if the transfer is shorter than 255:

{length, byte 0, byte 1, byte 2, ...}

and following structure if the transfer is longer:

{0xff, length 0, length 1, length 2, length 3, payload 0, payload 1, ...}

Do you think we could somehow capture it in a macro, or would we need to treat such peripherals as exceptions and let device drivers allocate bounce buffers?

Why is a bounce buffer needed here vs multiple length prefixed transfers in a list? Heap allocations and memcpy aren't free. Is there some hardware datasheets or drivers I can look at?

UART transferring constant data

Let's discuss an application that transfers constant byte-strings using UART. Currently, the application code would define these byte-strings as something like:

static const uint8_t str1[] = {0xde, 0xad, 0xbe, 0xef};
static const uint8_t str2[] = {0xca, 0xfe, 0xfa, 0xce};

Potentially placing data in NVM where some DMA have access, but others not necessarily.

With the new macro definition, I understand the code would need to be refactored to something like

DT_BUF_DEFINE(dev, str1_buf, sizeof(str1));
DT_BUF_DEFINE(dev, str2_buf, sizeof(str2));

static void copy_buf_content(void) {
  memcpy(str1_buf, str1, sizeof(str1));
  memcpy(str2_buf, str2, sizeof(str2));
}

what is functionally correct. However, it would cause unnecessary RAM buffers allocation and memcpy overhead for platforms where UART can access NVM directly.

That's true today to with bounce buffers isn't it?

spi_flash

SPI flash sample uses SPI (potentially with DMA) indirectly: main -> drivers/flash -> drivers/spi. I'm not sure which module should allocate SPI DMA buffers in this sample.

I don't think it should be done by main, because main is decoupled from the const struct device *spi - it is aware of const struct device *flash only. So it cannot define a buffer like DT_BUF_DEFINE(spi, spi_buf, size);.

An alternative would be to define a bounce buffer in drivers/flash and copy the payload passed by main, but I understand your goal is to avoid bounce buffers. Do you think we can somehow achieve it in a more complex DMA usage like this one?

In this case wouldn't the flash device tree node define things appropriately for applications? I'm a little lost on this example in particular.

@hubertmis
Copy link
Member

blinky_pwm

This seems like a pretty special case,

I would like to highlight here that every single peripheral in the system could be a special case like this. It is possible that any peripheral, including basic ones like PWM, GPIO, or timers, can use DMA internally to offload a CPU from accessing peripheral registers.

perhaps a specialized PWM set of buffer setup macros would be useful that understand the period and pulse

Do I understand correctly that it would require deprecating pwm_set_dt() and similar functions and replacing them with something like

DT_PWM_DATA(dev, name);

void set_pwm(period, pulse) {
  DT_PWM_SET_PERIOD(name, period);
  DT_PWM_SET_PULSE(name, pulse);
  pwm_set_apply(dev, name);
}

Solutions like this would cause an overhead I described in "UART transferring constant data" for every single device driver.

Is there some datasheet/ref manual or driver in the tree that does this today I could look at?

An example of PWM using DMA to load pulse values is here: https://infocenter.nordicsemi.com/topic/ps_nrf5340/pwm.html?cp=4_0_0_6_22
However, nRF53 has a basic memory hierarchy so no special handling of memory access is needed there (no cache: no need for alignment and padding and writing-back; common addressing across the system: no need for address translation; PWM's EasyDMA can access all memories in the system). If the memory hierarchy was more complicated, e.g. caches present, then it would be more difficult to handle and currently we are missing a systematic way in Zephyr to do that.

UART (again) with a dynamic payload offset in DMA buffer

Why is a bounce buffer needed here vs multiple length prefixed transfers in a list?

I'm not sure if a bounce buffer is needed. However, if the transfer size is determined in run-time, I cannot imagine a macro that could set the size and the offsets in the static buffer correctly. Maybe it would be possible with more complicated macros.

Is there some hardware datasheets or drivers I can look at?

I'm not aware of any publicly available datasheet or driver. Assuming an 802.15.4 radio's DMA expects PHR immediately followed by PSDU in RAM (like nRF5 radios), for 802.15.4 MSK PHY described in 17.1.1 of IEEE 802.15.4-2020, the length of PHR (encoding of the frame length) is one or two bytes long depending on the payload (PSDU) length. It's similar to the example I drafted in the previous comment.

UART transferring constant data

However, it would cause unnecessary RAM buffers allocation and memcpy overhead for platforms where UART can access NVM directly.

That's true today to with bounce buffers isn't it?

Some SoCs require special handling of RAM buffers for DMAs, others do not. Currently, Zephyr works perfectly fine with devices that do not require any special handling. DMAs in devices like this can access static const uint8_t str1[] = {0xde, 0xad, 0xbe, 0xef}; buffers directly, and Zephyr's driver model allows them to do that.

The RFC created by @carlocaione and this PR aim to design a systematic way to support SoCs requiring special handling of DMA RAM buffers. To answer your question: Yes, in these more complicated devices the additional buffer overhead and memcpy is most likely required for both bounce buffers and application-managed buffers solutions.

But my point is that we should avoid creating this overhead for SoCs which work fine without it.

I'm curious if we could define a driver API that allows its user to avoid the discussed overhead in simple SoCs, but still statically allocate buffers to avoid bounce buffers in complex SoCs.

spi_flash

In this case wouldn't the flash device tree node define things appropriately for applications? I'm a little lost on this example in particular.

I'll try to show my concern in a little more detailed example

Currently, the spi_flash sample contains code like:

const uint8_t expected[] = { 0x55, 0xaa, 0x66, 0x99 };
uint8_t buf[sizeof(expected)];
rc = flash_write(flash_dev, SPI_FLASH_TEST_REGION_OFFSET, expected, len);
rc = flash_read(flash_dev, SPI_FLASH_TEST_REGION_OFFSET, buf, len);

The expected and buf buffers are later translated by a chosen flash driver to something like data below (using spi_nor.c as an example):

        uint8_t buf[5] = { 0 };
	struct spi_buf spi_buf[2] = {
		{
			.buf = buf,
			.len = 1,
		},
		{
			.buf = data,
			.len = length
		}
	};

I understand the spi_nor.c code would need to be modified a little to use a static allocator like DT_BUF_DEFINE() for the buf buffer:

        DT_BUF_DEFINE(&driver_cfg->spi, buf, 5);
        memset(buf, 0, 5);

	struct spi_buf spi_buf[2] = {
		{
			.buf = buf,
			.len = 1,
		},
		{
			.buf = data,
			.len = length
		}
	};

However, it's not clear to me how the data buffer could be allocated with DT_BUF_DEFINE().

One option would be to allocate a buffer in the main.c:

const uint8_t expected[] = { 0x55, 0xaa, 0x66, 0x99 };
DT_BUF_DEFINE(/* how to get spi dev, if it's decoupled from flash API user? */, expected_buf, sizeof(expected));
memcpy(expected_buf, expected, sizeof(expected));

rc = flash_write(flash_dev, SPI_FLASH_TEST_REGION_OFFSET, expected_buf, len);

But to make it work we would need to increase coupling between software modules so that flash API user gets access to the bus device to which the flash module is connected. I don't think increasing coupling like this is a good idea.

Another solution would be to define a bounce buffer in spi_nor.c:

        DT_BUF_DEFINE(&driver_cfg->spi, bounce_buf, length);
        memcpy(bounce_buf, data, length);

But this solution relies on a bounce buffer that we discuss how to avoid.

@carlocaione
Copy link
Collaborator Author

My 2 cents.

Shortly, I think that the problem here is that there are cases where the application layer has no idea about the requirement at driver / hardware level about buffer allocation / overhead / metadata / etc...

Now, in many cases this information can indeed be obtained from DT, but there are cases where (for example the overhead, the internal buffer organization, etc...) the information is only known at run-time and not compile-time. For all these cases a static buffer allocation is probably not optimal (because we would end up introducing a different macro for each exception) or not achievable at all, so we must resort to a dynamic allocation of sorts (aka bounce buffers).

Now, I understand @andyross @teburd concerns but I don't see an easy way out. If the buffer cannot be statically defined at app layer, then it must be allocated by drivers and filled up with a user buffer provided by app (when needed, plus all the other needed DMA bits and pieces).

@teburd
Copy link
Collaborator

teburd commented May 17, 2023

My 2 cents.

Shortly, I think that the problem here is that there are cases where the application layer has no idea about the requirement at driver / hardware level about buffer allocation / overhead / metadata / etc...

That's what should be solved, making that available to the application.

Now, in many cases this information can indeed be obtained from DT, but there are cases where (for example the overhead, the internal buffer organization, etc...) the information is only known at run-time and not compile-time. For all these cases a static buffer allocation is probably not optimal (because we would end up introducing a different macro for each exception) or not achievable at all, so we must resort to a dynamic allocation of sorts (aka bounce buffers).

All the cases in Zephyr I've seen thus far can be solved with a macro that does just that. Please help me understand where this is coming from.

Now, I understand @andyross @teburd concerns but I don't see an easy way out. If the buffer cannot be statically defined at app layer, then it must be allocated by drivers and filled up with a user buffer provided by app (when needed, plus all the other needed DMA bits and pieces).

To repeat what @andyross said, this isn't Linux. To have this setup where the default would be to malloc+memcpy silently on a peripheral API call just because the application author didn't have the tools to setup a buffer right is massively wasteful and I'll -1 it all day. If you have special hardware with special needs because it has some unique logic as being described here, well that to me that seems like a vendor specific problem not a Zephyr problem. That should be solved in a vendor specific way not in a broad way that implies this is how Zephyr wishes to handle DMA buffers generally by default.

I'm happy to be wrong if examples can be found and given or some docs can be linked so I can better understand what problem is trying to be solved here. I haven't seen any good examples in the tree after some grepping around.

UART:

  • nrf has a memcpy from the source to a ram buffer, seemingly to avoid doing DMA from flash? Solvable with DT_BUF seemingly.

SPI:

  • esp32_spim mallocs and copies if the buffer isn't aligned. Solvable with DT_BUF, should not be done this way.
  • nrfx_spim copies if not in RAM. Solvable with DT_BUF

I2C:

  • nrfx_twim has a memcpy to seemingly work with easydma (which I believe does not have a traditional transfer list). Unsolvable in a general way. Unclear exactly to me the reasoning.

FLASH: (harder to grep for memcpy as many peripherals allow for mapping the flash into a bus address space...)

  • nrf_qspi does a memcpy to seemingly add a prefix. Solvable with DT_BUF potentially.
  • xmc4xxx does a memcpy to align. DT_BUF solvable
  • nios2 qspi does a memcpy to add a length prefix from a quick glance. Potentially DT_BUF solvable with a flash wrapper (as would nrf_qspi).

Happy to be shown something differently but I just don't see where this problem actually is. I do see the problem with setting up aligned and placed buffers. That is a well trod problem I think anyone who's written a test case with DMA involved can attest to. But a macro could solve it. I've linked to ones I know of first hand already.

@hubertmis
Copy link
Member

malloc+memcpy silently on a peripheral API call just because the application author didn't have the tools to setup a buffer right is massively wasteful and I'll -1 it all day.

Agreed. But from the analysis I presented in previous comments, it seems forcing the application authors to use DT_BUF on the platforms in which such allocation is not necessary is also massively wasteful, so I could -1 it all day :)

I'm curious if we can modify any of these solutions or find a new one that is not wasteful in systems where DMA can use any buffers, and in systems where DMAs have special needs.

@teburd teburd requested review from hakehuang and erwango May 17, 2023 14:39
@andyross
Copy link
Contributor

Driving by again: I don't have the DMA depth of expertise, but I'm still with @tomb here. This seems like a needlessly heavyweight solution for a problem that's well-constrained and solvable statically for almost everything we're talking about.

Note also that there's a natural abstraction layering here that's being violated: it's possible (if not trivially) to take an API designed for static allocation and indirect it at runtime with a dynamic buffer. It's not possible to magically make a runtime allocation into a static one in the general case (c.f. the undecidability of the Halting Problem, Marvin Minsky et. al.). This is what I meant upthread about "not being able to go back".

My general worry is that we'll continue to see use on devices and applications where static buffer allocation is a natural fit (SOF is exactly such an app, even though right now it has its own heap to allocate them), and they'll continue to use some kind of backdoor anyway to avoid the dynamic overhead. Let's not tempt them?

@hubertmis
Copy link
Member

Maybe it will be easier to find common ground if we decide on the basic expected properties of the system we would like to get.

I believe for simplicity we can discuss an application that sends through UART two strings:

  1. "Hello world": static const uint8_t hello_world[] = "Hello world";
  2. a generated random number: static uint8_t rand_num[RN_LEN];

Both buffers are configured statically by an application.

Now, let's consider two types of SoCs for which we compile this application

  1. STM32 where UART driver sends data byte by byte using USART's TDR (Transmit data register)
  2. nRF52 which contains UARTE requiring data to be placed in RAM
  3. I would love to discuss a device with a data cache, but could not find one supported by Zephyr yet.

For STM32 we would like to:

  1. Have hello_world buffer stored in ro-data in flash. The UART device driver copies data byte by byte from flash to USART's TDR.
  2. Have rand_num buffer stored in bss in RAM. The UART device driver copies data byte by byte from this buffer to USART's TDR.

For nRF52 we would like to:

  1. Copy hello_world from flash to RAM and configure UARTE's EasyDMA engine to point to the copy.
  2. Have rand_num buffer stored in bss in RAM. The UARTE's EasyDMA engine points directly to rand_num buffer without copying.

Do you agree these are the system properties we would like to get in a single portable application?

* @retval 0 on success
* @retval other errno codes depending on the implementation
*/
int dmm_release_buffer_for_dma_out(const struct device *dev, uint32_t flags,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if release a buffer why I need to know the flags?

* @retval 0 on success
* @retval other errno codes depending on the implementation
*/
int dmm_get_buffer_for_dma_out(const struct device *dev, uint32_t flags,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this possible to define where is the buffer comes from? say a memory pool point?or just a fixed physical address

* @param[out] dma_handle DMA handle.
* @param[in] buffer User-provided buffer.
* @param[in] buffer_size Size of the user-provided buffer.
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as out, can we add a parameter for where the buffer is comming from?

@hubertmis
Copy link
Member

I was thinking a little about the discussion we had during the last arch meeting.

We would like to achieve multiple quality attributes of the system:

  • low memory consumption
  • stability and accessibility
  • performance
  • developer-friendly API
  • and more

Changing priorities of these attributes leads to different solutions.

I'm thinking if we could merge (make available) all proposed solutions to make it possible for the application developer to choose which of the attributes is the highest priority for the application.

Here is what came to my mind looking at the example we discussed: audio application: getting audio data from SPI and sending them to I2S output.

Proposal 1: drivers responsible to manage buffers

The application code would be something like this:

uint8_t audio_frame[AUDIO_FRAME_SIZE];

spi_read(spi_dev, audio_frame, AUDIO_FRAME_SIZE);
i2s_send(i2s_dev, audio_frame, AUDIO_FRAME_SIZE);

User-friendliness is perfect - code is easy to write, understand, and maintain. Performance varies: in some socs spi_read() and/or i2s_send() would need to allocate bounce buffers and memcpy() data around; in other socs audio_frame buffer could be used directly by DMAs. Memory consumption and stability vary, depending if bounce buffer allocators are static or dynamic.

Proposal 2: application responsible to manage buffers

Code example:

DMM_BUFFER(spi_dev, spi_frame, AUDIO_FRAME_SIZE);
DMM_BUFFER(i2s_dev, i2s_frame, AUDIO_FRAME_SIZE);

spi_read(spi_dev, spi_frame, AUDIO_FRAME_SIZE);
memcpy(i2s_frame, spi_frame, AUDIO_FRAME_SIZE);
i2s_send(i2s_dev, i2s_frame, AUDIO_FRAME_SIZE);

User-friendliness is a little worse - using macros to allocate buffers is not intuitive. Performance is worse: we always have at least one memcpy() even if it's not needed in a target SoC. Stability is perfect with static memory allocation we cannot run out of buffers. Memory consumption may be better or worse than in proposal 1 - it depends if the target SoC would require bounce buffers.

Merged proposal: the application may manage buffers using macros, but the responsibility is in the drivers.

With this approach, the application could be implemented in multiple ways. One example would be exactly like "Proposal 1", another exactly like "Proposal 2", sharing all of its attributes.

One more which I think is the most interesting would be:

// Allocate statically buffer for SPI, because buffers written by DMA are usually more restricted than those read by DMA
DMM_BUFFER(spi_dev, audio_frame, AUDIO_FRAME_SIZE);

// Application ensures that the buffer sent to read SPI data can be used by SPI's DMA
spi_read(spi_dev, audio_frame, AUDIO_FRAME_SIZE);
i2s_send(i2s_dev, audio_frame, AUDIO_FRAME_SIZE);
// The buffer passed to I2S is not guaranteed to be usable by I2S's DMA

User-friendliness is similar to "Proposal 2" - using unintuitive macros to allocate buffers. Performance is perfect - the SoC-specific code in the drivers will ensure that we avoid memcpy() if possible. Stability and memory consumption vary between platforms depending on the buffer allocation strategy for I2S.

But what I think is most important in this solution is that we do not enforce "stability over performance" or "performance over user-friendliness". It's the application developer who decides which property is the most important for their application.

Other benefits of the merged proposal are:

  • no changes in the device driver model and existing device driver APIs
  • no changes in existing device drivers
  • no changes needed in existing samples or subsystems

Required changes:

  • new macros in device driver APIs to allow applications to allocate DMA-compliant buffers

@github-actions
Copy link

github-actions bot commented Sep 3, 2023

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Sep 3, 2023
@github-actions github-actions bot closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DNM This PR should not be merged (Do Not Merge) RFC Request For Comments: want input from the community Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Proposing a new DMA Memory Management (DMM) subsystem
5 participants