add better print() method #255

hannes · 2024-08-23T11:04:14Z

Currently, print()-ing a lazy df triggers materialization. Let's have a custom print method that does not do that so we don't trigger materialization accidentally

The text was updated successfully, but these errors were encountered:

hannes · 2024-08-23T11:08:35Z

we could for example just show the rel tree and the result schema

krlmlr · 2024-08-23T18:58:32Z

I'm advertising a drop-in replacement. If we do this, it would have to be an option in my book:

default behavior: full materialization
fast option: show first 10 rows
super-fast option: behavior as in arrow

Would that work for you?

joakimlinde · 2024-08-25T07:46:45Z

Advanced R (Hadley), 3.6.3 Printing:

One of the most obvious differences between tibbles and data frames is how they print. I assume that you’re already familiar with how data frames are printed, so here I’ll highlight some of the biggest differences using an example dataset included in the dplyr package:

Tibbles only show the first 10 rows and all the columns that will fit on screen. Additional columns are shown at the bottom.

Each column is labelled with its type, abbreviated to three or four letters.

Wide columns are truncated to avoid having a single long string occupy an entire row. (This is still a work in progress: it’s a tricky tradeoff between showing as many columns as possible and showing columns in their entirety.)

When used in console environments that support it, colour is used judiciously to highlight important information, and de-emphasise supplemental details.

Maybe mimicking the existing behavior solves the issue. For data frames you get a full materialization and for tibbles you just get the first rows and the first columns.

hannes · 2024-09-02T10:51:59Z

printing 10 rows can already lead to massive compute, e.g. with joins or aggregations. I would just show the schema, arrow style. And I would argue a drop-in replacement does not have to replicate print() behavior.

hadley · 2024-09-25T12:48:20Z

IMO we need to stick with the current print method or its going to be too foreign for existing dplyr users.

OTOH we might want to consider an explicit "lazy" mode that you could opt-in. In lazy mode, you'd require an explicit collect() to do computation. (Inspired somewhat by polars' lazy and eager modes)

DavisVaughan · 2024-09-25T13:44:55Z

It's possible that opt-in "lazy" mode could just be a global option that changes the print method, like options(duckplyr.lazy_print = TRUE). If the only two ways to eagerly evaluate the query are:

Printing
Calling collect()

then the idea would be that turning this option on limits it to just collect() as the way to force evaluation

Typically I'm against any global option that changes the way computations are done, but this one feels ok? i.e. if I hand a script off to my colleague and I had lazy printing turned on and he did not, then that probably won't break the script in any meaningful way.

krlmlr · 2024-09-25T13:56:27Z

Hm... One of the selling points for duckplyr was that touching the data frame (accessing a column or querying the number of rows) also materializes. I don't follow how collect() is the only way to materialize then.

I think we agreed that:

Printing should never materialize
Printing should push a LIMIT 21 to the computation tree, like dbplyr does
We want to double-check if printing is cancellable
Lazy printing (opt-in) would only show the schema, arrow-style
We could have a class that never materializes on print (Joakim suggests that "tibble" could be that class), no matter what the option says; although this doesn't seem necessary because the underlying behavior is the same in all cases

It's worth noting that pushing the LIMIT 21 may change the output order: head(collect(x)) isn't necessarily the same as collect(head(x)) unless we pin the output order (which costs performance again). We can give a hint to that effect when printing.

I also want to make sure that the header is printed while the computation is still running so that the user can abort if they're not satisfied with the structure of the output. This will require a tibble/pillar update, but dplyr will also benefit. I wonder if we can already print a footer (perhaps combined with a progress bar) that we erase and overwrite with the printed output, in interactive mode.

DavisVaughan · 2024-09-25T14:08:49Z

One of the selling points for duckplyr was that touching the data frame (accessing a column or querying the number of rows) also materializes

Oh right, duh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add better print() method #255

add better print() method #255

hannes commented Aug 23, 2024

hannes commented Aug 23, 2024 •

edited

Loading

krlmlr commented Aug 23, 2024

joakimlinde commented Aug 25, 2024 •

edited

Loading

hannes commented Sep 2, 2024 •

edited

Loading

hadley commented Sep 25, 2024

DavisVaughan commented Sep 25, 2024

krlmlr commented Sep 25, 2024

DavisVaughan commented Sep 25, 2024

add better print() method #255

add better print() method #255

Comments

hannes commented Aug 23, 2024

hannes commented Aug 23, 2024 • edited Loading

krlmlr commented Aug 23, 2024

joakimlinde commented Aug 25, 2024 • edited Loading

hannes commented Sep 2, 2024 • edited Loading

hadley commented Sep 25, 2024

DavisVaughan commented Sep 25, 2024

krlmlr commented Sep 25, 2024

DavisVaughan commented Sep 25, 2024

hannes commented Aug 23, 2024 •

edited

Loading

joakimlinde commented Aug 25, 2024 •

edited

Loading

hannes commented Sep 2, 2024 •

edited

Loading