Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a developer, I need a proxy server, so that I can handle streaming data #78

Open
4 tasks
jefffohl opened this issue Jan 14, 2016 · 28 comments
Open
4 tasks
Assignees
Milestone

Comments

@jefffohl
Copy link
Member

Motivation: proxy server can be used for streaming remote files on servers that do not have that support. And to implement seek (skip N bytes) in files or streams from JS.

This will be used in: #17

Tasks:

  • Handle local files
  • Handle remote files
  • Implement server-side CSV parsing?
  • When app is hosted on remote server, how do we handle file upload?
@jefffohl jefffohl self-assigned this Jan 14, 2016
@jefffohl jefffohl added this to the v1.0 milestone Jan 14, 2016
@breznak
Copy link
Member

breznak commented Jan 14, 2016

@jefffohl before you begin implementing this feature with proxy server from npm, would you consider a more difficult but also more far-reaching change and improving Papa? As the author said he won't implement it himself, but is not against the solution: mholt/PapaParse#49 (comment)

@jefffohl
Copy link
Member Author

@breznak - I would, but I don't think it is possible within the browser. The FileReader only takes a snapshot of the file at the time of the request, and it will not allow the browser to read the file again without a user interaction - most likely for security reasons.

For some reason, Chrome's handle on the file does allow you to read the file size as the file is updated, but the data in the file cannot be retrieved (this is what confused me for a day). See the HTML5 spec: https://www.w3.org/TR/FileAPI/#file

I believe this is the reason that the PapaParse developer is not willing to work on it - it is simply not possible.

@jefffohl
Copy link
Member Author

Actually, it appears that it might be possible to reload the file if we see that it has changed, but we would have to reload the entire file - which defeats our original purpose, because then we would have to reload all of the data, and on large files that would be very inefficient. What we need is the ability to read just the portion of the file that has changed.

@breznak
Copy link
Member

breznak commented Jan 14, 2016

Thank your for the Specs!
I still don't see how your workaround should work (or why Papa's shouldn't):

  • f is a reference to File object:
    • byte content is a snapshot of the file at creation of the reference
    • may/or may not provide modified and size parameter. (are these snapshoted=frozen too, or change dynamically?)
  • if we can get the parameters, monitor them periodically (else exception; still some support better than nothing)
  • if file changed, efficiently reread the new update
    • in the Spec I don't see a seek(=skip to Nth byte) or read with offset methods; this is a problem I don't know how you plan to work-around?
    • if such method exists, update and loop.

@jefffohl
Copy link
Member Author

The workaround - of reloading the entire file whenever the file size changes - can be seen here: http://stackoverflow.com/questions/22548683/reloading-a-file-using-html-input

@breznak
Copy link
Member

breznak commented Jan 14, 2016

I'm not sure I understood the SO solutions correctly, but these ideas might work:

  1. if Papa is that fast ( https://jsperf.com/javascript-csv-parsers/4 ), just reread the whole file. this could be implemented in the Papa's chunk call (with "monitor=true" argument) to return only the diff (to the graph for rendering). It is suboptimal, but would still be a huge simplification for many problems.
  2. Assuming we can destroy the source file, we can read, delete the read part, wait, loop. Saving the time on not-rereading the known bytes. This could also be implemented for Papa.
  3. How about Baby? https://github.com/Rich-Harris/BabyParse Does Node.js provide more "local app" privileges, allowing to access a file more directly?

@jefffohl
Copy link
Member Author

Node runs on the server, so it has all the permissions that you want to give it. So, yes, it can access any file on the system.

@jefffohl
Copy link
Member Author

So - yes, we could use Baby Parse and parse files on the server side instead of in the browser.

@breznak
Copy link
Member

breznak commented Jan 14, 2016

4 (We should follow @rhyolight 's advice and...) delegate this to other project, with different (lower) level of integration, that can easily take care of the file updates, and provide us only with a diff file which we would reread quickly and append to our data.
Eg a script (some multiplatform code?) like

while(true) { cp myFile myFile.old; sleep 5; diff myFile myFile.old > update; }

@jefffohl
Copy link
Member Author

delegate to what other project?

@breznak
Copy link
Member

breznak commented Jan 14, 2016

Well, we can just require to have only the diffs (not whole updated) file as input for Monitoring?
Or provide a simple utility (in Java, ...) to do the diffs in intervals for us, as above.
Or truncate the file (not sure a browser JS can do that?)

@breznak
Copy link
Member

breznak commented Jan 14, 2016

Can we make the (client, browser) app a "server-like app" that has REST API?
https://stackoverflow.com/questions/921942/javascript-rest-client-library
So we could have an update(data) method callable throught REST PUT? This was the idea in #42

Sorry, this was just a brainstorm/shitload 😛 of ideas, not sure which are doable or suitable for us..?

@jefffohl
Copy link
Member Author

I am imagining the server will have some REST-like features, but it will probably be just GET.

Why would you need PUT, if we are just reading CSV files?

@jefffohl
Copy link
Member Author

The server I am imagining will be pretty basic. It will handle the following functions:

  • Serving static files (index.html, JS, CSS, etc.)
  • Handling files on the local file system, and allowing us to stream them.
  • Acting as a proxy server for retrieving remote files from other servers.

@jefffohl
Copy link
Member Author

All that said, if we can define an abstract purpose for the server outside of the needs of this particular app, we could make it a separate project/repo.

@breznak
Copy link
Member

breznak commented Jan 14, 2016

yes, I think that's a good functionality for the server. Let me doublecheck I understand the advantages:
allows to stream remote files (even if the other server does not support that feature)? + streaming local files (does it solve the problem discussed here on avoiding re-reading the whole file for monitoring mode?)

My REST idea is probably a separate feature, allowing the updates be "sent" by REST calls (allows integration with many web services, which are restful, like RiverView), in addition to updates by writing to a file.

@jefffohl
Copy link
Member Author

Yes, your understanding is correct. And yes, it will solve the problem of avoiding the need to re-read the entire file each time it is updated. We will be able to have a server-side file handle that will allow us to read the file.

@breznak
Copy link
Member

breznak commented Jan 15, 2016

👍 🆒

@jefffohl
Copy link
Member Author

One thing that this brings up is that the experience will differ depending on whether the app is hosted locally or remotely (e.g. on a public web server). If the app is hosted locally, we can access and stream local files with continuous updates. If the app is hosted remotely, the only interface we will have to uploaded files (from the users computer) will be through the FileReader interface, which, as we know, has the limitation of not allowing us to update the data continuously.

I am hoping that I can make a somewhat elegant user experience that will automatically detect if the file is available on the same file system that the server is running on, and accept continuous updating. If the file is not available, the server will assume that the file is being sent remotely, and simply read a snapshot of the file.

@jefffohl
Copy link
Member Author

Something that I forgot about is that the FileReader interface won't give you any information about the file other than its size and name. It won't tell us the local path to the file, so the server won't be able to find the file.

The alternative is for the user to know the relative path to the local file, and enter that in as a string (the same way that they might enter a URL). The server could see that it is a local path, and then retrieve the file. Again, in this situation, if the app were hosted remotely, a local path would not work. And, now that I think about it, this would be a big security hole if the app were hosted on a public web server, because it would allow users to type in any local path, which would then tell the server to retrieve that file from the local (server's) file system, which is of course a very bad idea.

So, now I need to re-think all of this. Sorry, I should have gone through this logic earlier.

@jefffohl
Copy link
Member Author

So, I've been thinking this over, and I don't see a solution that would involve using the "Browse..." button to allow the user to locate a local file and load it into the app for online streaming. JavaScript in the browser is, by design, sandboxed for security reasons.

We could remove the "Browse..." button and require that all users supply a path to the file they would like to stream - either a local filepath, or a URL to a file hosted on a remote web server. If anyone ever wanted to host this app on a public server, they would need to disable the ability to supply a local filepath, and make the app only accept full URLs. This could be set in the server config.

@breznak what are your thoughts?

@breznak
Copy link
Member

breznak commented Jan 16, 2016

@jefffohl ..true about the sandboxed limitation of JS, so does this mean: the publicly hosted app will not be usable? And users will have to provide a path to the file, rather than selecting with with the file browser?

And we are doing it for the "monitoring" support, right?

If so, I'd suggest: A) just require the provided file for inputs does not contain all the points (with appended updates), but rather a diff with the updates only. So we can reread the whole file (no proxy server) each time. Or B) keep the current functionality and have a config that allows the proxy server & imposes the limitations you mention.

@jefffohl
Copy link
Member Author

@breznak - Yes, this is all being done for the monitoring support. What we do depends on what the typical use case is. If this app is most typically run locally on the same machine that is producing the file to read, then using a proxy server will probably be the best option, as it will allow for monitoring both remote and local files.

We can set up the server so that in order for it to monitor local files, the server needs to be started with a special flag. This way the user has to explicitly decide to allow that option, and will therefore hopefully understand that it should not be done on a public web server.

So - we can offer three ways of accessing a file:

  • A URL to a publicly available file. Can be monitored for updates by using a proxy server. Available whether the app is hosted locally or on a public web server.
  • A path to a local file. Can be monitored for updates by using the proxy server. This feature should not be enabled on public web servers.
  • A "Browse..." button that will allow the user to upload a file from their local file system. This file will be handled using the JavaScript FileReader API, which means that we can monitor the file (in Chrome at least), and do a kind of updating that is non-optimal - meaning that we periodically re-upload the entire file, and then slice off the diff and append that to our data in the chart.

For all of these, I still need to test these methods out to make sure I am not missing something important that would prevent our success.

@breznak
Copy link
Member

breznak commented Jan 16, 2016

How about this setting:

  • without a proxy
    • functionality as now.
    • monitoring works only if the file provided is diffs, so we read it
      periodically and append to the plotted data
  • with proxy (can this be detected automatically?)
    • works like you described (maybe we could disable monitoring when
      FileReader is used)
    • (should security be concern? running the server w/ privileges to access
      any file on local FS,..)

@jefffohl
Copy link
Member Author

I am not sure what you mean - you want to make the server optional?

@breznak
Copy link
Member

breznak commented Jan 17, 2016

..I was thinking that. Would it be too much work? If we can provide basic monitoring functionality as defaults, and detect the server and if the proxy is present, use its features for file streaming.

@jefffohl
Copy link
Member Author

It is more work. We have to have a server to serve the static resources anyway, so I don't see a benefit at this time. If, in the future, there appears to be a need for decoupling the app from the server then we can work on that feature at that time. As always, I would like development to be driven by real-world needs.

@breznak
Copy link
Member

breznak commented Jan 19, 2016

Esp. here I think we have clear real-world usecases: NuPIC live monitoring of a running model and RiverView...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants