Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple copies of messages received when on a 3G connection #1106

Closed
ralienpp opened this issue Jul 22, 2016 · 17 comments
Closed

Multiple copies of messages received when on a 3G connection #1106

ralienpp opened this issue Jul 22, 2016 · 17 comments
Labels
Bug Things to squish; generally used for issues Duplicate

Comments

@ralienpp
Copy link

This behaviour consistently replicates when the bot is on a system that is connected to the Internet over a 3G modem connection. The same system behaves correctly when the connection is over Wi-Fi.

At first glance it seems that the bot sends the same message multiple times, this is what it looks like:

<botmaster> .mycomand purpusefully_wrong_input param2
<bot> botmaster: Incorrect syntax
<bot> botmaster: Incorrect syntax
<bot> botmaster: Incorrect syntax
<bot> botmaster: Incorrect syntax
<bot> botmaster: Incorrect syntax

However, I am certain this happens because sopel "thinks" it received multiple messages, and each of its redundant responses is actually mapped to a redundantly received message. I know that because one of my commands spawns a new process, and the PID of the new process is in the bot's answer.

<botmaster> .reversessh example.com:2222
<bot> botmaster: Establishing SSH connection
<bot> botmaster: Establishing SSH connection
<bot> botmaster: Establishing SSH connection
<bot> botmaster: Establishing SSH connection
<bot> botmaster: Establishing SSH connection
<bot> ...
<bot> botmaster: SSH connection in progress, PID=25667
<bot> botmaster: SSH connection in progress, PID=25684
<bot> botmaster: SSH connection in progress, PID=25685
<bot> botmaster: SSH connection in progress, PID=25690
<bot> botmaster: SSH connection in progress, PID=25702
<bot> botmaster: SSH connection in progress, PID=25703

If those were dupes, the PIDs would be the same. But since the PIDs are different, it means that Sopel actually handled the command N times.

@maxpowa maxpowa added the Bug Things to squish; generally used for issues label Jul 22, 2016
@maxpowa
Copy link
Contributor

maxpowa commented Jul 22, 2016

This started happening when I merged a10cf3e, I think it's because Sopel isn't properly handling module unloading/loading when it reconnects. Probably will end up tying into #1056

@ralienpp
Copy link
Author

Is there a way to rule out this hypothesis? I've noticed this behaviour after a connection was established and maintained for hours on end, so I have reasons to doubt that is is related to reconnection logic.

@maxpowa
Copy link
Contributor

maxpowa commented Jul 25, 2016

Is the connection truly maintained for hours on end? I can't imagine 3G sustaining a connection for that long, it's probable that Sopel experienced a couple ping timeouts (and gracefully reconnected) over the course of that time.

@ralienpp
Copy link
Author

You're right, it is most likely not the same connection. Is there some kind of logging mechanism I could use to see if this is indeed the case?

What would be the most reasonable troubleshooting strategy for this issue?

@maxpowa
Copy link
Contributor

maxpowa commented Jul 26, 2016

Yeah, check your stdio.log - it should print something along the lines of "Warning: Disconnected. Reconnecting in 20 seconds..." whenever it loses connection. If you need timestamped logs, check raw.log but you'll have to look for message code 001 or similar.

@ralienpp
Copy link
Author

Having examined the logs I found a recurrent pattern, like this one, which is copied as many times as there are "re-reactions" to the incoming message.

Signature: error: [Errno 9] Bad file descriptor (file "/usr/lib/python2.7/socket.py", line 170, in _dummy)                           
Traceback (most recent call last):                                                                                                   
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/bot.py", line 441, in call                                
    exit_code = func(sopel, trigger)                                                                                                 
  File "/home/pi/dktest/venv-sopel/lib/python2.7/site-packages/sopel_modules/mymodule/mymodule.py", line 57, in get_sysinfo                  
    bot.reply(answer)                                                                                                                
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/bot.py", line 419, in reply                               
    self._bot.reply(message, destination, reply_to, notice)                                                                          
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/bot.py", line 378, in reply                               
    self.say(text, dest)                                                                                                             
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/bot.py", line 330, in say                                 
    self.write(('PRIVMSG', recipient), text)                                                                                         
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/bot.py", line 163, in write                               
    irc.Bot.write(self, args, text=text)                                                                                             
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/irc.py", line 148, in write                               
    self.send(temp.encode('utf-8'))                                                                                                  
  File "/home/pi/dktest/venv-sopel/local/lib/python2.7/site-packages/sopel/irc.py", line 260, in _ssl_send                           
    result = self.socket.send(data)                                                                                                  
  File "/usr/lib/python2.7/socket.py", line 170, in _dummy                                                                           
    raise error(EBADF, 'Bad file descriptor')                                                                                        
error: [Errno 9] Bad file descriptor                                                                                                 
Signature: error: [Errno 9] Bad file descriptor (file "/usr/lib/python2.7/socket.py", line 170, in _dummy)  

It thinks the file descriptor is bad (maybe assuming the connection is gone). However, all the responses are eventually sent a bit later - which means that the file descriptor shouldn't have been bad.

@ralienpp
Copy link
Author

ralienpp commented Jul 6, 2017

The plot thickens - I found a new clue for this puzzle. The problem occurs when I run Sopel on a 3G connection via systemd, but it never occurs when I run it manually as a process.

So, the summary is that the systemd + 3G combo yields such behaviour; if you remove one of these ingredients- it works as expected.

This is very puzzling, so if you have any suggestions as to what might be causing this - please let me know.

@dgw dgw added the Needs Info label Mar 29, 2018
@dgw
Copy link
Member

dgw commented Mar 29, 2018

Is there any further info on this? At some point it would be nice to close this as resolved (or out of Sopel's control), and the only way that can happen without just declaring the issue abandoned is with details. :)

@ralienpp
Copy link
Author

ralienpp commented Apr 3, 2018

The problem persists in the latest versions of Sopel as well, I was unable to troubleshoot it myself, unfortunately. However, I am willing to commit to running tests, trying out betas, or investigating it on my own if someone can provide new educated guesses about the root cause.

@dgw
Copy link
Member

dgw commented Apr 3, 2018

If it occurs only when running under systemd, that sounds like a possible bug in the systemd environment. But I'm getting ahead of myself—never assume a bug is someone else's fault unless you've proved that it's definitely not yours. 😀

I don't see that you've checked the raw log, so it could help pinpoint this if you set raw_log = true in your Sopel configuration file and reproduce the problem again, then include a relevant snippet here as you did with that exception a year ago (!) and a link to the full log saved somewhere (on a pastebin or something). The raw log should establish definitively whether your Sopel instance is receiving the same message multiple times, or handling the same message multiple times, and knowing which will determine how to approach further troubleshooting / a solution.

@ralienpp
Copy link
Author

ralienpp commented Apr 4, 2018

I did check the raw log back then, but did not post it here. At this point I am preparing an environment where I can replicate this. Let me share some initial key-facts and observations I've made so far:

  • This runs on a Raspberry pi with Raspbian 8
  • Sopel starts as a systemd service
  • Inside a virtualenv with Python 2.7
  • At the moment I have a fleet of deployed systems with Sopel 6.3.1. These are buses in a public transport system. Sopel is used to invoke OTA updates, and occasionally to establish reverse SSH tunnels (as the systems don't have a static IP address, I cannot SSH into them directly).

Given that this is a live system, I am reluctant to make an update to the latest version of Sopel. One system was selected for a "deeper" experiment, but there are some complications - the new Sopel depends on Ipython 6.x, which in turn demands Python 3.x - so at the moment my hands are relatively tied, because I don't want to yank out the turtle that stands at the bottom of the "turtles all the way down" stack :-)

Another complication is that this is a headless system that runs on a physically moving target, in an environment where I have no direct access to it, so I cannot not run it under systemd without losing the guarantee that Sopel will start and be shown on my "radar". Having said that:

  1. I cleared the logs on 2 machines
  2. Restarted Sopel
  3. After a clean restart of the service the problem does not manifest itself
  4. Over the course of an hour, the commands are still handled adequately, so perhaps there is some cumulative effect in play

Another observation that is worth mentioning - I never noticed this behaviour with Sopel's default commands, e.g. .uptime, it only happens when I run those that I implemented myself. This would be my top suspicion. Here is an example of such a function:

def run(command):
    '''Run the given command and return its output
    and exit_code'''
    process = Popen(command.split(), stdout=PIPE)
    (output, _status) = process.communicate()
    exit_code = process.wait()
    return output, exit_code


@commands('run')
def run_command(bot, trigger):
    '''Run an arbitrary command in the OS and return
    its status code and output'''
    command = trigger.group(ARGUMENT)
    output, status = run(command)
    bot.reply('Status: %i, Stdout: ' % status)
    for line in output.splitlines():
        bot.say(line)
    bot.say('----End of stdout')

It works as expected when behaviour is normal. Otherwise it still works, but the function is invoked multiple times.

@dgw
Copy link
Member

dgw commented Apr 4, 2018

At the moment I have a fleet of deployed systems with Sopel 6.3.1.

I'll review what's changed since then, for my own information, since I've only been maintaining since the run-up to 6.5.2 and didn't keep track of changes back then. But technically only the newest version of Sopel is supported, and the status of open issues should be confirmed in the last release (or on master)…

Given that this is a live system, I am reluctant to make an update to the latest version of Sopel.

But I understand where you're coming from on updates. 😸

One system was selected for a "deeper" experiment, but there are some complications - the new Sopel depends on Ipython 6.x, which in turn demands Python 3.x

I don't know where this came from. Version 6.5.1 (long the newest release until less than a week ago) didn't require IPython at all—incorrectly, because the ipython core module does require it to load. But when 6.5.2 moved the dependency to the main requirements file, it didn't specify a version. Your installer (presumably pip?) should just automatically fetch a version of IPython that's compatible with the virtualenv you're running it on. Sopel doesn't care what version you give it, as long as it loads.

Another observation that is worth mentioning - I never noticed this behaviour with Sopel's default commands

That's an interesting data point. You shared a pair of function defs from one of your custom modules, but are these modules that manifest the duplicating behavior coded as standalone files (like the core modules) or based on sopel-cookiecutter—i.e. packages? Because if they are, this is essentially a duplicate/extreme case of #1056 (which was mentioned here two years ago, I know) and possibly related to a10cf3e.


Should your newly cleared logs turn up any further information, I'm sure I'll hear about it soon, here.

As an aside, I think it's pretty neat that Sopel has found a use case in a public transit system. I'm glad you elected to share that detail. :-)

@ralienpp
Copy link
Author

ralienpp commented Apr 5, 2018

Some updates so far, the two systems where the test is conducted are still not exhibiting the symptoms.

  • S1, Sopel's uptime is 11h, the OS uptime is around 3 days
  • S2, Sopel's uptime is 6h, same as the uptime of its OS

Also:

  • I updated Sopel to 6.5.1, as you mentioned it is the last one that doesn't explicitly require Ipython.
  • I installed Ipython 5.0.0, as you said that it actually is required.

After the update, Sopel was restarted, and I'm keeping an eye on it, by sending commands every now and then.

As for your last question, I did not use the cookie-cutter template, but neither did I do that as a standalone file like the standard modules. I read the documentation, checked some examples, and implemented the plugin. Here's what the tree looks like:

.
├── doc
│   ├── sopel-config-template.cfg
│   ├── systemd-template-raspi.service
│   └── systemd-template.service
├── readme.rst
├── requirements.txt
├── setup.py
└── sopel_modules
    ├── __init__.py
    └── simplex
        ├── __init__.py
        ├── simplex.py
        ├── core.py
        └── standalone.py

Let's say the module is called simplex:

  • core.py contains the actual logic, i.e. a pure function that has nothing to do with Sopel and can be used in any other Python program, it has no dependencies, except the standard Python library. The function run(command) in my previous message is in this file.
  • simplex.py is where run_command(bot, trigger) resides, this file wraps the core functions, using Sopel's decorators and primitives.

Having examined the cookie-cutter template, I am inclined to say that my code replicates that model, as the tree structures are identical.

@dgw
Copy link
Member

dgw commented Apr 5, 2018

I see. So, you have Sopel load the sopel_modules.simplex package. This is almost certainly related to #1056, because we know that this kind of module (which was newly supported in 6.0) has some quirks related to reloading/reconnecting. It's all but confirmed, and I hope it can be confirmed by your logs once one of your systems manifests the issue.

On the one hand, it's frustrating to see package-module reloading cause issues. But I'd also rather it be the same as an already-known issue than something new. :D

@ralienpp
Copy link
Author

ralienpp commented Apr 5, 2018

Alright, I have some more clues, here's the raw log and some explanations about what is going on.

  • Note that there are two chat bots, both are running on the same system: r4_systemd and r4_console
  • One is running as a systemd service, the other was started from the console and left running in a byobu session (it is a tool like screen or tmux, but easier to use)
  1. You see me do .run ls /home and I get a nice response (note that it is in the systemd instance).
  2. Then I run .help to remember the command for forcing the reload of a module, inspired by your previous comments :-)
  3. The .help command crashes, but we've already established in the other issue that this has been fixed
  4. I run .run uptime (note that it is in the console instance)
  5. At this stage, both r4_systemd and r4_console react correctly
  6. My muscle memory reminds me that the command I wanted is .reload <module name>, so I do that in the console session, but not in the systemd session.
  7. I run .run uptime in the console session and I get duplicated output
  8. I run .reversessh (my other custom command) in the systemd session and the behaviour is as expected

So based on this, I can say it is not about systemd, not about 3G and not about disconnect events, at least not at the surface :-) I was able to replicate the issue by forcing a .reload simplex.

I then did it in the systemd instance of Sopel and yes - the problem came back. (this part is already not in the log, because I did it after retrieving it).

...
A long sequence of PING/PONG ... 
...
>>1522921858.23 PING irc.mothership.hz
<<1522921858.31 :fe7b828616de.irc.mothership.hz PONG fe7b828616de.irc.mothership.hz :irc.mothership.hz
>>1522921941.86 PING irc.mothership.hz
<<1522921941.91 :fe7b828616de.irc.mothership.hz PONG fe7b828616de.irc.mothership.hz :irc.mothership.hz
<<1522921944.26 :[email protected] JOIN :#bots
<<1522921944.26 :[email protected] JOIN :#bots
<<1522921959.67 :[email protected] PRIVMSG r4_systemd :.run ls /home
>>1522921959.7  PRIVMSG ralienpp :ralienpp: Status: 0, Stdout:
>>1522921960.5  PRIVMSG ralienpp :pi
>>1522921961.3  PRIVMSG ralienpp :----End of stdout
<<1522921964.99 :[email protected] PRIVMSG r4_systemd :.help
>>1522921964.99 PRIVMSG ralienpp :Hang on, I'm creating a list.
>>1522921966.05 PRIVMSG ralienpp :Sorry! Something went wrong.
<<1522921972.64 :[email protected] PRIVMSG r4_systemd :.help
>>1522921972.64 PRIVMSG ralienpp :Hang on, I'm creating a list.
>>1522921973.61 PRIVMSG ralienpp :Sorry! Something went wrong.
<<1522921981.08 :[email protected] PRIVMSG r4_console :.help
>>1522921981.09 PRIVMSG ralienpp :Hang on, I'm creating a list.
>>1522921982.39 PRIVMSG ralienpp :Sorry! Something went wrong.
<<1522921986.26 :[email protected] PRIVMSG r4_console :.run uptime
>>1522921986.28 PRIVMSG ralienpp :ralienpp: Status: 0, Stdout:
>>1522921987.37 PRIVMSG ralienpp : 09:53:06 up  8:31,  1 user,  load average: 0.07, 0.02, 0.00
>>1522921988.17 PRIVMSG ralienpp :----End of stdout
<<1522921990.56 :[email protected] PRIVMSG r4_console :.uptime
>>1522921990.57 PRIVMSG ralienpp :I've been sitting here for 2:16:57 and I keep going!
<<1522921994.47 :[email protected] PRIVMSG r4_console :.reload simplex
>>1522921994.55 PRIVMSG ralienpp :ralienpp: <module 'simplex' from '/home/pi/tracker/venv-sopel/lib/python2.7/site-packages/sopel_modules/simplex/__init__.pyc'> (version: 2017-07-03 23:45:33)
<<1522922014.3  :[email protected] PRIVMSG r4_console :.run uptime
>>1522922014.33 PRIVMSG ralienpp :ralienpp: Status: 0, Stdout:
>>1522922015.42 PRIVMSG ralienpp : 09:53:34 up  8:31,  1 user,  load average: 0.04, 0.01, 0.00
>>1522922016.22 PRIVMSG ralienpp :ralienpp: Status: 0, Stdout:
>>1522922017.02 PRIVMSG ralienpp :----End of stdout
>>1522922018.11 PRIVMSG ralienpp : 09:53:34 up  8:31,  1 user,  load average: 0.04, 0.01, 0.00
>>1522922018.91 PRIVMSG ralienpp :----End of stdout
<<1522922038.31 :[email protected] PRIVMSG r4_systemd :.reversessh
>>1522922038.31 PRIVMSG ralienpp :ralienpp: Establishing SSH connection
>>1522922039.2  PRIVMSG ralienpp :ralienpp: SSH connection in progress, PID=14264
>>1522922122.05 PING irc.mothership.hz
<<1522922122.09 :fe7b828616de.irc.mothership.hz PONG fe7b828616de.irc.mothership.hz :irc.mothership.hz
>>1522922158.48 PING irc.mothership.hz
<<1522922158.55 :fe7b828616de.irc.mothership.hz PONG fe7b828616de.irc.mothership.hz :irc.mothership.hz

Final point: note that I have to separate processes writing to the same log files, so some events are duplicated, e.g. <<1522921944.26 :[email protected] JOIN :#bots - that's not an anomaly.

@ralienpp
Copy link
Author

ralienpp commented Apr 5, 2018

So yeah, in retrospect, it is obviously #1056, the whole 3g/systemd saga was a false trail.

@dgw
Copy link
Member

dgw commented Apr 5, 2018

Based on that raw log, yes I think we definitely have enough information to call this a duplicate of #1056! Believe it or not, I consider this a victory. It means there's still only one (known) bug related to modules, commands, and reconnection/reloading, instead of two separate ones to find and fix.

I don't have a timeline on fixing #1056, but I'm not going to release Sopel 7 without squishing that nasty bug, for reasons brought up in other issues (like #1291, which result in an awful user experience without fixing this reloading problem).

Thank you for taking the time to dump and sanitize your logs for me, @ralienpp! At least I can take great pleasure in closing an old issue. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Things to squish; generally used for issues Duplicate
Projects
None yet
Development

No branches or pull requests

3 participants