Skip to content
This repository has been archived by the owner on Sep 6, 2023. It is now read-only.

Write to nonblocking sockets fails #166

Open
peterhinch opened this issue Aug 22, 2017 · 10 comments
Open

Write to nonblocking sockets fails #166

peterhinch opened this issue Aug 22, 2017 · 10 comments

Comments

@peterhinch
Copy link
Contributor

peterhinch commented Aug 22, 2017

In the course of updating this asynchronous MQTT driver I want to make it compatible with ESP32. The code uses a nonblocking socket. I get OSError 119 when writing to the socket. The following test runs on ESP8266 but fails on ESP32.

import network
import socket
from time import sleep
import uerrno
SERVER = '192.168.0.9'  # can use 'iot.eclipse.org'
s = network.WLAN(network.STA_IF)
s.active(True)

def connect():
    s.disconnect()
    sleep(1)
    s.connect('SSID', 'PASSWORD')  # Redacted EDIT this
    print('Awaiting conection')
    while not s.isconnected():
        sleep(1)
    print('Got conection, pausing')  # Just to make sure
    sleep(10)
    print('conection done')

def foo(n):
    print('Attempt:', n)
    connect()
    addr = socket.getaddrinfo(SERVER, 1883)[0][-1]  # MQTT
    s = socket.socket()
    s.setblocking(False)
    try:
        s.connect(addr)
    except OSError as e:
        if e.args[0] not in [uerrno.EINPROGRESS, uerrno.ETIMEDOUT]:
            print('Error connecting', e)
    try:
        s.write(b"\xc0\0") # MQTT ping
        print('Write succesful.')
    except OSError as e:
        print('**** ERROR writing ****', e)
    s.close()

foo(1)

Note that I issue disconnect() before connecting. This is because my driver reconnects after a network outage. It also ensures consistent behaviour if an application is started with a network connection already active.

@MrSurly
Copy link
Contributor

MrSurly commented Aug 22, 2017

Looking into this ...

@MrSurly
Copy link
Contributor

MrSurly commented Aug 22, 2017

Why not check for EINPROGRESS when attempting to write? Code below works for me

import network
import socket
from time import sleep
import uerrno
SERVER = '192.168.0.9'  # can use 'iot.eclipse.org'
SERVER = 'iot.eclipse.org'
s = network.WLAN(network.STA_IF)
s.active(True)

def connect():
    s.disconnect()
    sleep(1)
    s.connect('SSID', 'PASSWORD')  # Redacted EDIT this
    print('Awaiting conection')
    while not s.isconnected():
        sleep(1)
    print('Got conection, pausing')  # Just to make sure
    sleep(3)
    print('conection done')

def foo(n):
    print('Attempt:', n)
    connect()
    addr = socket.getaddrinfo(SERVER, 1883)[0][-1]  # MQTT
    s = socket.socket()
    s.setblocking(False)
    try:
        s.connect(addr)
    except OSError as e:
        if e.args[0] not in [uerrno.EINPROGRESS, uerrno.ETIMEDOUT]:
            print('Error connecting', e)
            return

    attempts = 3
    while attempts:
        attempts -= 1
        try:
            s.write(b"\xc0\0") # MQTT ping
            print('Write succesful.')
            break
        except OSError as e:
            if e.args[0] not in [uerrno.EINPROGRESS, uerrno.ETIMEDOUT]:
                print('**** ERROR writing ****', e)
        sleep(.5)
    s.close()

foo(1)

@peterhinch
Copy link
Contributor Author

Here it works only by ignoring the 119 error and trying again. I've never seen the ESP8266 raise 119 and I'm not sure of its significance - perhaps you could advise. Is it always legitimate to trap it and re-send the entire buffer (which in the general case might be long)?

Running your code with only the WiFi credentials changed:

>>> import rats1
I (179751) wifi: wifi firmware version: f092575
I (179751) wifi: config NVS flash: enabled
I (179751) wifi: config nano formating: disabled
I (179751) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE
I (179761) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE
I (179781) wifi: Init dynamic tx buffer num: 32
I (179781) wifi: Init dynamic rx buffer num: 64
I (179781) wifi: wifi driver task: 3ffd926c, prio:23, stack:4096
I (179781) wifi: Init static rx buffer num: 10
I (179791) wifi: Init dynamic rx buffer num: 0
I (179791) wifi: Init rx ampdu len mblock:7
I (179801) wifi: Init lldesc rx ampdu entry mblock:4
I (179801) wifi: wifi power manager task: 0x3ffde558 prio: 21 stack: 2560
I (179811) wifi: wifi timer task: 3ffdf5d4, prio:22, stack:3584
I (179831) phy: phy_version: 355.1, 59464c5, Jun 14 2017, 20:25:06, 0, 0
I (179831) wifi: mode : null
I (179841) modsocket: Initializing
I (179851) wifi: mode : sta (24:0a:c4:05:e5:58)
I (179851) wifi: STA_START
Attempt: 1
Awaiting conection
I (181211) wifi: n:3 0, o:1 0, ap:255 255, sta:3 0, prof:1
I (181771) wifi: state: init -> auth (b0)
I (181781) wifi: state: auth -> assoc (0)
I (181781) wifi: state: assoc -> run (10)
I (182681) wifi: connected with misspiggy, channel 3
I (182681) wifi: event 4
I (183261) event: ip: 192.168.0.33, mask: 255.255.255.0, gw: 192.168.0.1
I (183261) wifi: GOT_IP
Got conection, pausing
conection done
**** ERROR writing **** 119
Write succesful.
>>> I (191781) wifi: pm start, type:0


>>>  

@MrSurly
Copy link
Contributor

MrSurly commented Aug 23, 2017

The ESP8266 and ESP32 use two different versions of LWIP (ESP8266 version is from 2012, ESP32 from 2016).

The ESP8266 version should throw an EINPROGRESS if it doesn't connect immediately when timeout is set to 0. The fact that it doesn't suggests the underlying LWIP implementation is different. The ESP32 version doesn't seem to implement timeouts for connect at all, unless one of the underlying LWIP timeout options also applies to connect. But it fails on write -- it's almost as if the write triggers something in LWIP to complete the connection process.

@peterhinch
Copy link
Contributor Author

Which I'd understand if it threw EINPROGRESS or ETIMEDOUT on the first call to write(). The remedy would be to delay and retry. But error 119 is ENAVAIL "No XENIX semaphores available" which seems unexpected.

Throwing an exception on connect wouldn't help the user unless there were a way subsequently to poll the connection status before using the socket. In the absence of such a method the only way seems to be to use the socket, trap the exception, and retry until at least one character is written; I guess this conforms to modern Python coding practices.

I performed a quick test where I trapped the 119, including tests of writing a longer string. It seems that it only fails on the first pass. The second pass correctly writes the string.

So perhaps this boils down to the LWIP raising a "wrong" exception number in the event of a pending connected status. If you think we're on the right track I'll proceed on that basis.

@nickzoic
Copy link
Collaborator

I turned some debugging on ... lwip_send is getting error -5 back from netconn_write_partly and then translating that to EINPROGRESS. The trouble is that EINPROGRESS isn't consistently defined:

esp-idf/components$ egrep -r '#define\s+EINPROGRESS\s'
lwip/include/lwip/lwip/arch.h:#define  EINPROGRESS    115  /* Operation now in progress */
newlib/include/sys/errno.h:#define EINPROGRESS 119		/* Connection already in progress */

LWIP seems to be picking up on 119 definition, whereas lwip would normally define it as 115 and that's what py/mperrno.h things too ...

@dpgeorge
Copy link
Member

LWIP seems to be picking up on 119 definition, whereas lwip would normally define it as 115 and that's what py/mperrno.h things too ...

Nice digging! There's already code in esp32/modsocket.c to convert EINPROGRESS to MP_EINPROGRESS, so the fix should be to make sure that the EINPROGRESS in that file is the same one that lwip uses. It's strange that the newlib errno.h is being used though.... how is that happening, is it on purpose?

@peterhinch
Copy link
Contributor Author

I should add that I've also seen 118 in my testing. My code treats EINPROGRESS, ETIMEDOUT, 118 and 119 as indicating that the connection is not yet established. So far it seems reliable.

@nickzoic
Copy link
Collaborator

Yeah, I didn't quite get to the bottom of it but there's a LWIP_PROVIDE_ERRNO flag there which isn't being set, instead it is using newlib's one. Not sure why we define our own MP_E* as well ...

@dpgeorge
Copy link
Member

Not sure why we define our own MP_E* as well ...

Because some toolchains use non-standard errno codes and in the past was confusing that some ports would report certain error numbers, while others would report other numbers for the same thing. It was decided to standardise around one set of errno codes so that it was easier to work out what the code corresponded to (for those codes that don't have a human-readable string associated with them in the uerrno module).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants