Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add better print() method #255

Open
hannes opened this issue Aug 23, 2024 · 8 comments
Open

add better print() method #255

hannes opened this issue Aug 23, 2024 · 8 comments

Comments

@hannes
Copy link
Contributor

hannes commented Aug 23, 2024

Currently, print()-ing a lazy df triggers materialization. Let's have a custom print method that does not do that so we don't trigger materialization accidentally

@hannes
Copy link
Contributor Author

hannes commented Aug 23, 2024

we could for example just show the rel tree and the result schema

@krlmlr
Copy link
Member

krlmlr commented Aug 23, 2024

I'm advertising a drop-in replacement. If we do this, it would have to be an option in my book:

  • default behavior: full materialization
  • fast option: show first 10 rows
  • super-fast option: behavior as in arrow

Would that work for you?

@joakimlinde
Copy link
Contributor

joakimlinde commented Aug 25, 2024

Advanced R (Hadley), 3.6.3 Printing:

One of the most obvious differences between tibbles and data frames is how they print. I assume that you’re already familiar with how data frames are printed, so here I’ll highlight some of the biggest differences using an example dataset included in the dplyr package:

  • Tibbles only show the first 10 rows and all the columns that will fit on screen. Additional columns are shown at the bottom.
  • Each column is labelled with its type, abbreviated to three or four letters.
  • Wide columns are truncated to avoid having a single long string occupy an entire row. (This is still a work in progress: it’s a tricky tradeoff between showing as many columns as possible and showing columns in their entirety.)
  • When used in console environments that support it, colour is used judiciously to highlight important information, and de-emphasise supplemental details.

Maybe mimicking the existing behavior solves the issue. For data frames you get a full materialization and for tibbles you just get the first rows and the first columns.

@hannes
Copy link
Contributor Author

hannes commented Sep 2, 2024

printing 10 rows can already lead to massive compute, e.g. with joins or aggregations. I would just show the schema, arrow style. And I would argue a drop-in replacement does not have to replicate print() behavior.

@hadley
Copy link
Member

hadley commented Sep 25, 2024

IMO we need to stick with the current print method or its going to be too foreign for existing dplyr users.

OTOH we might want to consider an explicit "lazy" mode that you could opt-in. In lazy mode, you'd require an explicit collect() to do computation. (Inspired somewhat by polars' lazy and eager modes)

@DavisVaughan
Copy link
Member

It's possible that opt-in "lazy" mode could just be a global option that changes the print method, like options(duckplyr.lazy_print = TRUE). If the only two ways to eagerly evaluate the query are:

  • Printing
  • Calling collect()

then the idea would be that turning this option on limits it to just collect() as the way to force evaluation


Typically I'm against any global option that changes the way computations are done, but this one feels ok? i.e. if I hand a script off to my colleague and I had lazy printing turned on and he did not, then that probably won't break the script in any meaningful way.

@krlmlr
Copy link
Member

krlmlr commented Sep 25, 2024

Hm... One of the selling points for duckplyr was that touching the data frame (accessing a column or querying the number of rows) also materializes. I don't follow how collect() is the only way to materialize then.

I think we agreed that:

  • Printing should never materialize
  • Printing should push a LIMIT 21 to the computation tree, like dbplyr does
  • We want to double-check if printing is cancellable
  • Lazy printing (opt-in) would only show the schema, arrow-style
  • We could have a class that never materializes on print (Joakim suggests that "tibble" could be that class), no matter what the option says; although this doesn't seem necessary because the underlying behavior is the same in all cases

It's worth noting that pushing the LIMIT 21 may change the output order: head(collect(x)) isn't necessarily the same as collect(head(x)) unless we pin the output order (which costs performance again). We can give a hint to that effect when printing.

I also want to make sure that the header is printed while the computation is still running so that the user can abort if they're not satisfied with the structure of the output. This will require a tibble/pillar update, but dplyr will also benefit. I wonder if we can already print a footer (perhaps combined with a progress bar) that we erase and overwrite with the printed output, in interactive mode.

@DavisVaughan
Copy link
Member

One of the selling points for duckplyr was that touching the data frame (accessing a column or querying the number of rows) also materializes

Oh right, duh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants