Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable AutoAddPolicy for SSH backend and use known_hosts and ssh config file #1683

Closed
mxmlnkn opened this issue Sep 19, 2024 · 11 comments

Comments

@mxmlnkn
Copy link
Contributor

mxmlnkn commented Sep 19, 2024

AutoAddPolicy makes the user vulnerable to man-in-the-middle attacks. I think, it should not be added by default and/or there should be an option to disable/enable it. If the user-wide ~/.ssh/known_hosts file is correctly loaded, then the AutoAddPolicy should likely not be necessary in the first place. For SSH, this policy change needs to be enabled with ssh -o StrictHostKeychecking=no ... or via the SSH config file.

Loading the SSH config file is yet another feature request of mine. For example, currently, things like ssh cluster works because I have configured an alias for that in my ~/.ssh/config, but fsspec.open("ssh://cluster") does not work because paramiko does not automatically load the usual OpenSSH files even though it has helper APIs to load them manually.

@martindurant
Copy link
Member

I am no expert at SSH, so would be happy to see reasonable changes like this in a PR. Also, be aware that sshfs might be more fully featured.

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Sep 19, 2024

I have opened a PR. Thanks for the pointer to Python sshfs based on asyncssh. I didn't see this mentioned in the API reference's list of known implementations. It is confusing having so many alternatives. Should the built-in one be deprecated then (if and only if sshfs turns out to be much more feature-complete)?

@martindurant
Copy link
Member

I didn't see this mentioned

This is definitely an omission!

Should the built-in one be deprecated then

I don't directly maintain sshfs; I would be happy to have it supersede the paramiko implementation if they think there would be no lost functionality.

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Sep 19, 2024

Should the built-in one be deprecated then

I don't directly maintain sshfs; I would be happy to have it supersede the paramiko implementation if they think there would be no lost functionality.

Any opinions about this @shcheklein @efiop ? Is it feature complete? What was the motivation for that SSHFS implementation? Is it better than the one shipped with fsspec? Is it still being worked on?

I was also wondering about the maintainers with access to the repository, but that seems to have been sorted out in fsspec/sshfs#45.

As for the issues I have:

@shcheklein
Copy link

Any opinions about this @shcheklein @efiop ?

I'm fine with that. I don't remember all the reasons behind our decision to implement it, but from what I can recollect it was because of paramiko having some edge case that were hard to fix (e.g. configs) + performance.

Some background: iterative/dvc#6064
iterative/dvc#5395

and a bunch of other related discussions. @isidentical and @efiop should remember the details better.

Is it feature complete?

it should be, yes. And it should be more or less straightforward to add other components if needed.

Is it still being worked on?

Yes, not full time. Driven by bugs / issues in DVC. But yes, we are trying to keep an eye on it. I can't guarantee any response time, etc though.

This issue with the AutoAddPolicy also affects fsspec/sshfs because known_hosts is set to None.

can we do git blame to see if it was there from the very beginning?

partially, probably was driven by DVC needs (and overall the fact that it's a python lib). There is not interactive mode (like in the SSH CLI client) when you use it. wdyt?

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Sep 22, 2024

Thank you for the feedback. I obviously can confirm the config issues and it is sad to see that the indirectly linked issue is open for 8 years by now even though pull requests do exist for this feature. This issue from 6 years ago complaining about the amount of open issues and, worse, the hundreds of open pull requests mirrors the opinion I formed over the past few days trying to use paramiko.

I can also confirm the performance issues and I should have done some simple benchmarks with fsspec vs, sshfs before trying to integrate paramiko into my project. I was simply put off by the async usage of asyncssh, and paramiko is 8x more popular on Github, and has a really nice SFTP API, which can be almost directly used to implement the FUSE callbacks. That's why I tried with paramiko first. After some initial slight trouble with the weird non-standard conforming file object, I noticed the poor performance and I don't see an easy way to work around this.

Here is my workaround for the file object bug because it might also affect usage of fsspec.sftp:
class SFTPFile(io.RawIOBase):
    # For a reference implementation based on RawIOBase, see "class SocketIO(io.RawIOBase)" in:
    #   https://github.com/python/cpython/blob/00ffdf27367fb9aef247644a96f1a9ffb5be1efe/Lib/socket.py#L683
    # or others implementations inside cpython:
    #   https://github.com/python/cpython/blob/00ffdf27367fb9aef247644a96f1a9ffb5be1efe/Lib/_compression.py#L33
    # For the internals of RawIOBase and others, see:
    #   https://github.com/python/cpython/tree/main/Lib/_pyio.py#L619
    # The reasons for this wrapper are:
    #  - paramiko.sftp_file.SFTPFile is incompatible to io.IOBase: https://github.com/paramiko/paramiko/issues/2452
    #  - I need the file object to inherit from io.RawIOBase or else the isinstance(fileobj, io.IOBase) check in
    #    compression.detectCompression will fail!
    #  - Add option to clean up resource MountSource when this file is closed.

    def __init__(
        self, fileObject: IO[bytes], mountSource: Optional[MountSource] = None, closeMountSource: bool = False
    ) -> None:
        io.RawIOBase.__init__(self)
        self.fileObject = fileObject
        self.mountSource = mountSource
        self.closeMountSource = closeMountSource

        self.seekable = self.fileObject.seekable
        self.readable = self.fileObject.readable
        self.writable = self.fileObject.writable
        self.readinto = self.fileObject.readinto
        self.read = self.fileObject.read
        self.tell = self.fileObject.tell

    def __enter__(self):
        return self

    def __exit__(self, exception_type, exception_value, exception_traceback):
        self.close()

    @overrides(io.RawIOBase)
    def close(self) -> None:
        if self.closeMountSource and self.mountSource:
            self.mountSource.close()

    @overrides(io.RawIOBase)
    def fileno(self) -> int:
        # This is a virtual Python level file object and therefore does not have a valid OS file descriptor!
        raise io.UnsupportedOperation()

    @overrides(io.RawIOBase)
    def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
        # https://github.com/paramiko/paramiko/issues/2452
        self.fileObject.seek(offset, whence)
        return self.tell()

    @overrides(io.RawIOBase)
    def readall(self) -> bytes:
        # It is necessary to implement this, or else the io.RawIOBase.readall implementation would use
        # io.DEFAULT_BUFFER_SIZE (8 KiB). Notably, this would ignore the block size configured in BufferedReader,
        # when calling read(-1) on it because it thinks that raw.readall is a fast-path, but in this case is ~100x
        # slower than 4 MiB reads equal to the Lustre-advertised block size.
        # https://github.com/python/cpython/issues/85624
        chunks = []
        while True:
            result = self.read()
            if not result:
                break
            chunks.append(result)
        return b"".join(chunks)

Here is a very simple performance comparison that might be already be better than nothing for fsspec/sshfs#2 :

import time
import fsspec.implementations.sftp
fs = fsspec.implementations.sftp.SFTPFileSystem("127.0.0.1")
for i in range(5):
    t0=time.time(); size=len(fs.open('silesia.tar.gz').read()); t1=time.time()
    print(f"Read {size} in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s")
# Read 68238807 in 16.93 s -> 4.03 MB/s
# Read 68238807 in 16.74 s -> 4.08 MB/s
# Read 68238807 in 16.74 s -> 4.08 MB/s
# Read 68238807 in 16.75 s -> 4.07 MB/s
# Read 68238807 in 16.70 s -> 4.09 MB/s

import sshfs
fs = sshfs.SSHFileSystem("127.0.0.1")
for i in range(5):
    t0=time.time(); size=len(fs.open('silesia.tar.gz').read()); t1=time.time()
    print(f"Read {size} in {t1-t0:.2f} s -> {size/(t1-t0)/1e6:.2f} MB/s")
# Read 68238807 in 2.06 s -> 33.18 MB/s
# Read 68238807 in 2.07 s -> 32.99 MB/s
# Read 68238807 in 2.04 s -> 33.43 MB/s
# Read 68238807 in 2.01 s -> 33.93 MB/s
# Read 68238807 in 2.04 s -> 33.48 MB/s

So yeah, paramiko is ~8x slower than asyncssh for this most simple of simple read-only usages. This seems to be a problem with the paramiko.sftp_file.SFTPFile abstraction because a simple paramiko.sftp_client.SFTPClient.get reaches speeds of ~100 MB/s, similar to calling scp in a subprocess. Compared to that, this read via the file object is 25x slower and basically makes it unusable except for toy projects or maybe small text files. There are lots of similar performance issues open for paramiko, which are helpfully collected here.

Yes, not full time. Driven by bugs / issues in DVC. But yes, we are trying to keep an eye on it. I can't guarantee any response time, etc though.

That sounds good enough. Thank you for your timely response to my impertinent questions.

This issue with the AutoAddPolicy also affects fsspec/sshfs because known_hosts is set to None.

can we do git blame to see if it was there from the very beginning?

Github can do git blame and it shows this refactor commit from 3 years ago, which is basically at project birth. So yeah, seems to have been there from the beginning.

partially, probably was driven by DVC needs (and overall the fact that it's a python lib). There is not interactive mode (like in the SSH CLI client) when you use it. wdyt?

The non-interactivity is indeed a problem. For my use case, the known hosts are not something that changes every day, so I would add the host as a known host by logging in once via (interactive) ssh before using any non-interactive tool. I can see that there are other use cases and that disabling this verification is simply the easiest way to make it work for all use cases, but I still think that it should at least be configurable. And if there is no interactivity, it can only succeed or fail (with a helpful error message suggesting to add the host as a known host).

@martindurant
Copy link
Member

I'm not sure what I ought to do from our end here. @mxmlnkn , you would recommend raising DeprecationWarning when instantiating fsspec's paramiko backend, with a view to removing it in the future?

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Sep 24, 2024

Hm, I guess there are also arguments against deprecating it because paramiko and asyncssh are two very different backends. I guess I'm fine with nothing happening at all. I'll probably simply switch to fsspec/sshfs and set a fixed known_hosts and do the URI parsing and config loading myself.

@mxmlnkn mxmlnkn closed this as completed Sep 24, 2024
@martindurant
Copy link
Member

OK. Can I ask:

I was simply put off by the async usage of asyncssh

Is there any downside from this? The sync API should hide the implementation detail from you unless you need async.

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Sep 24, 2024

OK. Can I ask:

I was simply put off by the async usage of asyncssh

Is there any downside from this? The sync API should hide the implementation detail from you unless you need async.

No, you are completely right that the sync API via sshfs perfectly hides the async API of asyncssh. But, I was trying to use and benchmark asyncssh directly to have a fairer comparison to paramiko, which I was already using directly, and I simply have not worked with async at all yet, so I failed to get asyncssh to work. I am not aware about any intrinsic downsides of async, e.g., in terms of performance. It's just that it is hard to use and has a "viral effect". Then again, how is asyncssh hiding the async API in the first place? I saw some solutions that had to spin up a separate thread. That could indeed have performance or thread-safety implications...

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Oct 5, 2024

Is there any downside from this? The sync API should hide the implementation detail from you unless you need async.

fsspec/sshfs#53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants