Skip to content
This repository has been archived by the owner on May 9, 2020. It is now read-only.

OSX Mavericks Process Killed on Fork #29

Open
andrewgross opened this issue Jan 29, 2014 · 15 comments
Open

OSX Mavericks Process Killed on Fork #29

andrewgross opened this issue Jan 29, 2014 · 15 comments

Comments

@andrewgross
Copy link

Hey,

Discovered a nasty bug that only seems to manifest on the latest version of OSX, Mavericks, due to some additional security features they enabled.

The issue arises when you try to spawn a subprocess, or use os calls like execp which replace your running process. It seems that when there are open remote connections, the python process has a file descriptor pointing a mac kernel queue so that it can receive events from the OS. This shows up in lsof -p $PID as a KQUEUE.

This behavior is normal, but before running any of the fun OS process commands, the documentation warns us to flush STDOUT and close any open file handles as these would normally be inherited by the child process.

Unfortunately, it seems that something about the way PyChef builds it's connections, is that it cannot clean them up properly in this use case. In the code where I discovered the bug, I used boto extensively but was unable to replicate the issue like I can below.

When you trigger the bug, your process will exit, only stating Killed: 9 in the terminal. In the Mac Console App you can see that the system log has ... ssh: guarded fd exception: ... and a link to a kernel dump.

If you have a Mac with a fresh install of OSX Mavericks, you can replicate the bug with the following code:

import os
from chef import ChefAPI, Search

ChefAPI(server_url, key_path, username)
Search('node', 'role:*')[0]
os.execvp('ssh', ['ssh', '[email protected]'])

I haven't had a change to dive in to the PyChef code to figure out where it is leaking connections (or perhaps some other event that it is subscribing to from the queue).

@andrewgross
Copy link
Author

Pretty sure the issue is coming from this line https://github.com/coderanger/pychef/blob/master/chef/api.py#L195

We are not closing the connection at the end. For an example use case check out http://stackoverflow.com/questions/3880750/closing-files-properly-opened-with-urllib2-urlopen

I will try to work out a PR soon.

EDIT: Still working on this, did not seem to solve the problem by adding .close().

@andrewgross
Copy link
Author

Digging deeper this seems to be a bug in urllib2, a new minimal test case:

import os
import urllib2
request = urllib2.urlopen('http://www.google.com')
response = rr.read()
request.close()
os.execvp('ssh', ['ssh', '[email protected]'])
Killed: 9

@andrewgross
Copy link
Author

I am unable to produce the same error when using python-requests. Any chance of switching over to that?

import os
import requests
response = requests.get('http://www.google.com')
os.execvp('ssh', ['ssh', '[email protected]'])
# Doesn't crash

@coderanger
Copy link
Owner

@andrewgross I wouldn't want to introduce a new dependency just for this given the current nebulous and unknown nature of things, but switching to both requests and cryptography will probably happen when I do a security overhaul on the SSL and signature code at some point. Thats a much bigger change though.

@andrewgross
Copy link
Author

That is unfortunate as I currently need the ability to execv/fork from some code using PyChef. I will explore other options.

Thanks

@coderanger
Copy link
Owner

@andrewgross For now you should be able to trivially subclass ChefAPI in your code and replace the usage of urllib2 :-) Just override that one hook method and use a different transport (such as requests).

@andrewgross
Copy link
Author

Just tried swapping out urllib2 for requests. Still seems to be keeping KQUEUE open. I can fetch URL with requests outside of pychef with no issues, only using requests inside of PyChef do I see the KQUEUE bug.

Potentially there are additional libraries causing bug while constructing / signing the request?

@coderanger
Copy link
Owner

Hmm, the background API stack in a thread local is certainly possible. If you manually clear that list out before forking does it help? (on a train so I can't get the code ref for you, the thread local is at the top of api.py though).

@andrewgross
Copy link
Author

If I take the headers and URL from PyChef and run them with requests in another terminal I do not see the issue. Additionally, if I put the debugger just before the remote call is made inside PyChef, I do not see the KQUEUE file descriptor open. At this point I cannot figure out why it having issues when using requests inside PyChef, but it is fine performing the same call outside of PyChef.

@coderanger
Copy link
Owner

@andrewgross Its probably either something in how I am using OpenSSL or how I keep resources in weakrefs on the threadlocal stack.

@andrewgross
Copy link
Author

Here is a gist of the changes I made to have requests working for .get() requests: https://gist.github.com/andrewgross/8922636

Definitely some weird action at a distance stuff. The issue doesn't appear until I actually make the remote call. I can show the url and request_headers without it cropping up.

@coderanger
Copy link
Owner

Can you post the diff too? Hard to see whats changed otherwise.

@andrewgross
Copy link
Author

I added the original as an edit to the first, so the ADD and DELETE are reversed, but should be helpful:

https://gist.github.com/andrewgross/8922636/revisions

Just to be clear, I only made this work for .get() requests for testing, it doesn't deal with .post()

@andrewgross
Copy link
Author

The workaround to the python bug I submitted fixes the issue for me:

http://bugs.python.org/issue20585

I may still keep working on an alternate python chef implementation on the side. Mostly due to curiosity, but also because I got the RSA signing working with M2Crypto, which I feel was the harder part.

Thanks for all the help.

Edit: For anyone finding this if the link doesn't work, the fix is to run export no_proxy='*' in your shell to change you environment so urllib2 cleans up after itself.

@flipdazed
Copy link

I have had this problem whilst using requests to systematically search through the free Twitter API search. It simply hangs:

searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra'
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:543205767046635519
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:542981583989268479
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:542746105017274367
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:542497788248866815
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:542432302848942080
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:542394308188311552
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:542076491237187583
searching for: 'NZ Dairy since:2014-12-08 -from:Fonterra before id:541888788373323775
Killed: 9

I've just used export no_proxy='*' in the terminal shell but it didn't work.

Fix:
I realised I had a misplaced line of code which should have been after my search loop. I was iterating through the list of every tweet after each search rather than doing it just at the end. This, I think was causing the error because now it doesn't occur.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants