-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss: LazyFrame.plot? #13339
Comments
I think this is not worth the effort. If someone wants to create a plot, just materialize and then plot, and it seems like it would be a lot of work for a probably not-common case. |
Remember that the plots are not static. They are interactive if you select Bokeh or Plotly backend. They might even contain widgets to filter the data shown. This enables browsing through data that are larger than can be shown in a browser or even can be hold in memory. When you use .plot the plots are not materialised, they are just configurations/ definitions. As a user, you might continue changing the color, the kind of plot, filter to a subsection of data. The plot is not materialized before its the output of a cell in a notebook, I dont understand why its risky not to materialize explicitly before or after |
Thanks for explaining What I'm concerned about is the risk that someone does df = (
pl.scan_parquet(...)
.group_by(...)
.#more expensive operations
) and then, in the next cell of their notebook,
The plot is shown, and then they realise that, actually, they want to colour by plant species, and so they do
in the next cell. If within Does this make sense / seem like a valid concern? |
Just to reply to:
This is a common situation, and I think we should trust users. It is possible that any given user will make this mistake once or twice, but if the computation truly is slow, they will soon figure out they can first collect the frame and then tinker with their plot. That said, I have no idea how the plotting frameworks could be enabled to use Polars' processing capabilities - this seems like a strong entangling of the libraries that would require updates with every API change, but I remain open minded. |
Thanks for your input! I'd like to think that there's a solution somewhere in between
but don't yet know exactly what it would be. Happy to discuss more and have a think, maybe we could have a group call. It's good to take our time with this one What impresses me about hvplot is that they currently do self._data.select(columns).collect() for their |
In the long run I'd like to push things even further down, i.e. HoloViews should implement a full Polars interface, that keeps things lazy until the last possible moment. This would allow things like histogram calculations, bar plot aggregations, and ideally datashader support to never materialize the full data in memory. As for what to do about people accidentally generating plots and are infeasible either because they trigger huge amounts of computation or a silly amount of categories, that's a problem we have to find a more general solution for at the hvPlot level. The unfortunate problem that we've always had is that if the data is huge then even a |
I've thought about this a bit, and it's similar to #13928 Materializing might be unnecessarily expensive if we're just trying to show some aggregations in a plot. With that in mind, adding |
I think holoviz/holoviews#5939 maybe should be considered a necessary precursor. Because with that, the |
I'm just breaking this off from #13238
Here's the comments made so far:
@MarcoGorelli :
@MarcSkovMadsen
@MarcoGorelli
@ritchie46
@jbednar
@MarcoGorelli
@hoxbro
I did discuss on API kind of like
with Ritchie, so that predicate pushdown can happen and you only read from the parquet file the columns you need to make the plot.
There is a risk though that people will repeatedly make cells like this and so re-trigger the
scan_parquet
part multiple times, and plots in particular tend to incentivize interactive development.But, for running jobs which create reports, or for plotting larger-than-memory datasets, this could unlock value.
One to think about. This would require some precursor work in hvPlot before it could be added to Polars
The text was updated successfully, but these errors were encountered: