Continuous Serialization and Image Data Streaming #676

Jake-Carter · 2023-11-14T06:34:38Z

Issue template

Hardware description: MAX78000 / Custom board
RTOS: FreeRTOS
Installation type: micro_ros_setup / custom static library
Version or commit hash: Humble

Hello - I'm an engineer with Analog Devices and I've been working on micro-ros support for our microcontrollers, starting with our embedded AI micros. I've completed the port and custom transports following the excellent tutorials, so first off thank you for the great documentation and project. I would eventually like to open a PR with official part support for our MSDK microcontrollers, and I'm building up a cool demo using an OpenManipulator-X running some custom object detection on our MAX78000.

Let me know if there's a better channel/repo to go through for questions. I couldn't get the Slack channel invite to work.

My current challenge is related to the topic of continuous serialization mentioned in the bottom part of this tutorials page.

I'm currently publishing a sensor_msgs/image image successfully, but the transmissions are very slow since the message is broken up into many packets. I would like some way to continuously stream the image data instead, but still comply with the expected message framing protocol. So...

Is "continuous serialization" what I'm looking for?
The tutorial says the ping_uros_agent example shows an example for continuous serialization of image data, but I don't see it. Are there any examples for this?
I can sort of guess at what the APIs do based on the tutorial, but the API documentation for the microcdr and continuous_serialization modules are somewhat limited. I'm confused about what the ucdr_alignment functions do, and also whether it's possible stream image data row by row from the serialization callback. In the given example does writing into the ucdr buffer push the data out into the transport layer?

// Implementation example:
void serialization_cb(ucdrBuffer * ucdr){
    size_t len = 0;
    micro_ros_fragment_t fragment;

    // Serialize array size
    ucdr_serialize_uint32_t(ucdr, IMAGE_BYTES);

    while(len < IMAGE_BYTES){
      // Wait for new image "fragment"
      ...

      // Serialize data fragment
      ucdr_serialize_array_uint8_t(ucdr, fragment.data, fragment.len); // <-- (JC):  Does this go out to the transport layer?...
      len += fragment.len;
    }
}  // ... or is the data finally sent here, when the callback returns?

I also have some more general suggestions/questions related to some challenges I had in developing the custom transports, and would love to contribute back to the project in any way I can.

Thank you,
Jake

The text was updated successfully, but these errors were encountered:

pablogs9 · 2023-11-14T06:48:20Z

Hello @Jake-Carter,

Hello - I'm an engineer with Analog Devices and I've been working on micro-ros support for our microcontrollers, starting with our embedded AI micros. I've completed the port and custom transports following the excellent tutorials, so first off thank you for the great documentation and project. I would eventually like to open a PR with official part support for our MSDK microcontrollers, and I'm building up a cool demo using an OpenManipulator-X running some custom object detection on our MAX78000.

Nice to hear that, I'm going to move this internally so we can be in touch.

Is "continuous serialization" what I'm looking for?

Continuous serialization is a kind of advanced feature of the middleware that allows the user to control a multi-stage serialization, it imposes some restrictions to the user type. The main one is that you need to "remove" the buffer part of your type. This implies modifications on your std_msgs/Image type.

I'm not sure if this is the straightforward way and probably you are looking for increasing your transport MTU and/or increasing of middleware stream history. How big is your payload?

I also have some more general suggestions/questions related to some challenges I had in developing the custom transports, and would love to contribute back to the project in any way I can.

I've just accepted you in the micro-ROS Slack, do not hesitate to contact me, open new issues or contribute via pull requests.

Jake-Carter · 2023-11-14T23:01:35Z

Thanks @pablogs9,

The ability to increase the transport MTU is very helpful. My current payload is a 160x120 RGB888 image (57600 bytes). My microcontroller only has 128KB of SRAM, so it's fairly constrained.

FreeRTOS, the micro-ros library, and my setup code seem to take up about 62KB of SRAM, so I was able to increase the MTU size to 2048 before I started to run out of space. It does help with the speed by the expected factor of 4x over the default though.

There are some tricks I can do with the CNN accelerator, so this will let me proceed in the short term. In general, are there any disadvantages to extremely large MTU sizes?

I have a couple other questions as well, so I will reach out via Slack.

Thanks for your support.

pablogs9 · 2023-11-15T07:16:03Z

Well, sending a payload that is almost 45% of your available memory is always kind of hard. In my experience sending single images over micro-ROS/XRCE is possible, but sending video will require some more resources.

In any case, even using continuous serialization will force you to use best-effort streams, which implies that losing a single fragment will cause the whole frame to be lost.

Before going into continuous serialization, do you have any possibilities of compressing into JPEG and sending it via CompressedImage?

Jake-Carter · 2023-11-23T02:03:27Z

Thanks @pablogs9, there are a couple of challenges I found this week:

For camera sensors without a built-in JPEG encoder we are reliant on our floating-point engine and the CMSIS-DSP library to implement the compression. This is pretty slow, and it's difficult to run it fast enough to keep up with the camera's data rate.
In the past, we have run DCTs inside our CNN accelerator at 60x improved speed over CMSIS-DSP, but only for 1D audio. We think it may be possible to implement for 2D images, but it will take some more research. So at the moment, we don't have a good hardware solution for image compression.
All of our CNN models have been trained with uncompressed image data. In general, we haven't tried using compressed inputs but there is some interesting research that shows it could be promising. In the meantime, though, obtaining the uncompressed images is important for us to be able to test the exact data input and collect datasets.

If continuous serialization could offer additional speed improvements I would be interested in exploring it. From what I've seen so far there are two sources of the slow speed:

Delays between each MTU

Your guidance on increasing the MTU size helped a lot with this, and I've achieved good results reducing this delay as much as my memory allows. I'm not sure where the overhead that's associated with this is coming from, but maybe as I get more familiar with the library I can test it further. It could also be related to my FreeRTOS port and implementation, or just unavoidable small delays from the complexity of the library.

Delay gaps inside each MTU

I'm seeing almost a 1ms delay inside each MTU, and this was more unexpected.

It's happening because the library splits up the frame bytes and data bytes into two separate transport calls, but the time it takes between them is longer than I expected. This was one of the main challenges I had developing the custom transports since my UART FIFOs are very shallow (only 8 bytes). I ended up implementing a queue to extend my FIFO so I wouldn't miss bytes inside each MTU.

For example, I captured the logic trace below on the RX side while I was developing my transport functions.

When the "indicator" signal is high, I am inside my serial transports.
When the "indicator" signal is low, my transport functions have exited, and the micro-ROS library has control.
Ignore the small blip towards the end of the gap - I used that to measure my DMA setup time.

You can see its actively waiting for the frame data first. It gets enough bytes and returns (B). The micro-ROS library takes about 800uS to jump back into the transport read for the rest of the data (A).

size_t vMXC_Serial_Read (
        struct uxrCustomTransport* transport,
        uint8_t* buffer,
        size_t length,
        int timeout,
        uint8_t* error_code)
{
    TickType_t elapsed = 0;
    const TickType_t xMaxBlockTime = pdMS_TO_TICKS(timeout);

    MXC_GPIO_OutSet(indicator.port, indicator.mask); // <-- A (transition low to high, we have entered the transport)

    unsigned int num_received = 0;
    while(num_received < length && elapsed < xMaxBlockTime) {
        if (uxQueueMessagesWaiting(rx_queue) > 0) {
            if(xQueueReceive(rx_queue, &buffer[num_received], 1)) {
                num_received++;
            }
        }
        elapsed++;
    }

    MXC_GPIO_OutClr(indicator.port, indicator.mask); // <-- B (transition high to low, we are exiting the transport)

    return num_received;
}

So, since there is ~1ms delay inside each MTU and I need thousands of MTUs to transmit the large image data, then I was hoping that the continuous serialization would give me the hooks I need to manually transmit my frame data. That way I could simultaneously eliminate the 1ms delay inside each MTU and the delay between MTUs.

Sorry for the novel :) - just wanted to provide some more context into the challenges I've seen so far with the transmission speed and extremely large messages.

pablogs9 · 2023-11-23T07:02:46Z

How are you ensuring that this is an active wait when there are no messages in the queue:

while(num_received < length && elapsed < xMaxBlockTime) {
        if (uxQueueMessagesWaiting(rx_queue) > 0) {
            if(xQueueReceive(rx_queue, &buffer[num_received], 1)) {
                num_received++;
            }
        }
        elapsed++;
    }

I mean, if uxQueueMessagesWaiting(...) == 0, this will loop for less than xMaxBlockTime ticks, right?

I also wonder why you are struggling with the reception of packets and serial read operations if your objective is to send an image.

Could you clarify these two points?

Jake-Carter · 2023-11-23T20:01:04Z

How are you ensuring that this is an active wait when there are no messages in the queue:

I have my DMA controller constantly unloading the RX FIFO behind the scenes. Every byte, it triggers an ISR to place the received byte in rx_queue.

My full transport implementation can be found here.

I also wonder why you are struggling with the reception of packets and serial read operations if your objective is to send an image.

I have everything working now, but this was something I struggled with a few weeks ago.

I wanted to show the read side because the timing issues caused more critical failures to connect with the micro-ROS agent. The gap above can cause incoming bytes to be missed, whereas any delays on the TX side will just slow down communication. Also, I only saved a logic capture for the read side.

Today we started a short Thanksgiving break so I will capture a trace during an image transmission as soon as I can next week.

Jake-Carter · 2023-11-28T01:21:18Z

Hi @pablogs9, I have some updated captures that show the 2 types of delay more clearly. The trace can be opened with Saleae Logic. The zip also includes a -v6 agent log file. The baud rate used is 115200.

adi_micro-ros_tx_image_captures.zip

Delays between each Transport Unit

Here is an image that highlights the delays between each image data packet (I hope "Transport Unit" is the right term here?).

On average it's between 200-300ms per TU.

When the red "Indicator" line is high, the code is inside my custom serial write function. Here is a closer look between two TUs.

Delays inside each Transport Unit

This image shows the delay between the frame and data portions of the transport unit. It's actually worse than the 1ms delay I captured on the RX side, since it looks like the publisher is waiting on a response from the agent for the frame.

The delay originally varied between 1-16ms.

After I decreased my USB latency timer with echo 1 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer, the variability improved to about 2-3ms.

Continuous Serialization?

So - basically I would like to know if continuous serialization would let me bypass most of the frame/packeting requirements for the image data. Ideally I'd like to just send one frame, and then manually serialize the data as I receive it.

Thanks for your support,
Jake

pablogs9 · 2023-11-28T06:40:36Z

Hello @Jake-Carter,

So - basically I would like to know if continuous serialization would let me bypass most of the frame/packeting requirements for the image data. Ideally I'd like to just send one frame, and then manually serialize the data as I receive it.

Continuous serialization will behave the same, because in this mode you provide the serialization data on-the-fly, but the transport layer and framing layer will be the same.

After I decreased my USB latency timer with echo 1 > /sys/bus/usb-serial/devices/ttyUSB1/latency_timer, the variability improved to about 2-3ms.

This detail led me to think that those delays are related to your underlying hardware, did you perform any test without micro-ROS?

Jake-Carter · 2023-11-28T22:52:38Z

I see, thanks @pablogs9. Could you provide any guidance on the colcon options for building the micro-ros library with stream framing disabled?

Is

"microxrcedds_client": {
    "cmake-args": [
        // ...,
        "-DUCLIENT_PROFILE_STREAM_FRAMING=OFF",
        // ...
    ]
},

and

rmw_uros_set_custom_transport(
        // MICROROS_TRANSPORTS_FRAMING_MODE,
        MICROROS_TRANSPORTS_PACKET_MODE, // <-- Use "packet" mode instead of framing when setting custom transports
        (void *)&transport_config,
        vMXC_Serial_Open,
        vMXC_Serial_Close,
        vMXC_Serial_Write,
        vMXC_Serial_Read
    );

sufficient?

This detail led me to think that those delays are related to your underlying hardware, did you perform any test without micro-ROS?

I see the same general ~1ms USB latency even without micro-ROS. We're going through an FTDI USB-serial converter, so I think that's unavoidable. However, the framing protocol itself introduces an additional ~1-2ms for each packet just waiting on the header response, and then the 200-300ms delay between each packet is definitely from the micro-ROS library

pablogs9 · 2023-11-29T07:17:00Z

I see, thanks @pablogs9. Could you provide any guidance on the colcon options for building the micro-ros library with stream framing disabled?

You cannot go on top of a Serial port without framing, because the agent needs to "isolate" each XRCE packet. Nonframing mode is used for transports that ensures the "packetization", UDP is an example.

I'm not sure about the implications of this, but maybe it would help increasing the buffer sizes of the framing module, check rb and wb here:

https://github.com/eProsima/Micro-XRCE-DDS-Client/blob/0c6743ffa358f26ca9433e951c534ec2f96be37a/include/uxr/client/profile/transport/stream_framing/stream_framing_protocol.h#L41

I'm not sure if this will have implications on the behavior of the transport.

I see the same general ~1ms USB latency even without micro-ROS. We're going through an FTDI USB-serial converter, so I think that's unavoidable. However, the framing protocol itself introduces an additional ~1-2ms for each packet just waiting on the header response, and then the 200-300ms delay between each packet is definitely from the micro-ROS library

Is your application code available so I can take a look or try to replicate it in another board to check those delay values?

Jake-Carter · 2024-01-19T02:07:26Z

Hi @pablogs9, hope you've been well and had a good start to the new year.

I've been working on an internal beta release for micro-ROS integration into the MSDK, and have staged things on the dev/micro-ros of our repo. I've written an install.py script that installs ROS + micro-ROS and builds the micro-ROS Agent using the micro_ros_setup scripts. (Documentation here). Maybe it will be useful as a contribution back to the micro-ROS repos in the future.

On this ticket - most of my troubles were coming from a lack of knowledge on the concepts. Especially the QoS models. "best effort" streams match much better for my applications. All the delays and jitter seems to come from the Linux side, so eliminating as many message frames as possible works great. For your reference my app code is available here and library support files here.

...

However, I saw failures when I tried to publish an image with best effort and traced it to the stream implementation here. I notice stream->size is set to UCLIENT_CUSTOM_TRANSPORT_MTU, and that best effort does not implement any message fragmentation. So far larger messages it returns an error.

In your experience, would it be possible to implement the same message fragmentation as the reliable streams here but without the extra XRCE frame headers/confirmations added? In my case I would be willing to accept any data loss in favor of the reduced transmission latency

pablogs9 · 2024-01-19T07:44:23Z

Hello @Jake-Carter, nice to hear about your progress. For sure we are interested in having this integrated in the micro-ROS repos.

WRT your question: in XRCE, best-effort streams do not allow fragmentation, so if your payload is an image you need to use reliable streams or configure a big enough buffer so an image fits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous Serialization and Image Data Streaming #676

Continuous Serialization and Image Data Streaming #676

Jake-Carter commented Nov 14, 2023

pablogs9 commented Nov 14, 2023

Jake-Carter commented Nov 14, 2023

pablogs9 commented Nov 15, 2023

Jake-Carter commented Nov 23, 2023 •

edited

Loading

pablogs9 commented Nov 23, 2023

Jake-Carter commented Nov 23, 2023

Jake-Carter commented Nov 28, 2023

pablogs9 commented Nov 28, 2023

Jake-Carter commented Nov 28, 2023

pablogs9 commented Nov 29, 2023

Jake-Carter commented Jan 19, 2024

pablogs9 commented Jan 19, 2024

Continuous Serialization and Image Data Streaming #676

Continuous Serialization and Image Data Streaming #676

Comments

Jake-Carter commented Nov 14, 2023

Issue template

pablogs9 commented Nov 14, 2023

Jake-Carter commented Nov 14, 2023

pablogs9 commented Nov 15, 2023

Jake-Carter commented Nov 23, 2023 • edited Loading

Delays between each MTU

Delay gaps inside each MTU

pablogs9 commented Nov 23, 2023

Jake-Carter commented Nov 23, 2023

Jake-Carter commented Nov 28, 2023

Delays between each Transport Unit

Delays inside each Transport Unit

Continuous Serialization?

pablogs9 commented Nov 28, 2023

Jake-Carter commented Nov 28, 2023

pablogs9 commented Nov 29, 2023

Jake-Carter commented Jan 19, 2024

pablogs9 commented Jan 19, 2024

Jake-Carter commented Nov 23, 2023 •

edited

Loading