Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow qsv diff to show only fields that differ #2000

Open
mfripp opened this issue Jul 26, 2024 · 5 comments
Open

Allow qsv diff to show only fields that differ #2000

mfripp opened this issue Jul 26, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@mfripp
Copy link

mfripp commented Jul 26, 2024

Is your feature request related to a problem? Please describe.
In csv files with many columns, it can be difficult and unreliable to find the particular fields that differ between dropped and added rows. This requires carefully scanning across the output, using a grid-oriented csv viewer.

Describe the solution you'd like
One possible solution would be to add a --drop-identical-fields flag (or something similar), which will cause identical fields between a "-" and "+" row to be replaced with either empty values or a flag like "(same)". Then, before outputting the results, any columns that don't have any changes (i.e., the column is entirely full of empty fields or "(same)" markers) will be dropped. So the output file will only contain the key columns and any data columns that actually have differences, and even in those, it will only show values when there are differences. This will make it easy to see exactly what data is different between the two files.

Describe alternatives you've considered
One alternative is to open the result in a spreadsheet and add flags to indicate where differences occur, but this is cumbersome. Currently I just scan visually across pairs of rows, but this is also cumbersome and error prone.

Another option might be to output a sort of "patch" format, with one row per different field. This could be a table where the first n fields are the index values, the next field is called "column" and gets the name of the field that differed, the next field is called "left_value" and has the value of this field from the left file, and the final field is called "right_value" and has the value from the right file. That might be clearer (no risk of conflict with existing empty fields or fields that already say "(same)"), but I'm not sure it's better.

Another option that might be better would be to use color to highlight the columns that are actually different, at least when output is sent to a TTY. This would be similar to the display in the GNU version of the diff command, VS Code's diff view, Apple's FileMerge viewer or vim -d file1 file2.

Additional context
(none)

@jqnatividad jqnatividad added the enhancement New feature or request label Jul 27, 2024
@jqnatividad
Copy link
Owner

Thanks for the well thought-out feature request @mfripp !

Copying in @janriemer - csv-diff's maintainer...

@janriemer
Copy link

Thank you, @mfripp, for the detailed description and thoughts on this feature (and @jqnatividad for making me aware of it)!
I really like the possible solution you've described and I feel like this should have highest priority regarding next features of diff command.

@jqnatividad Can you please assign this issue to me. Thank you.

The possibility of getting the fields that are different is actually already in the implementation of diff - it is just not used yet (waiting on a feature request like yours 😉):

qsv/src/cmd/diff.rs

Lines 245 to 251 in 08cfda6

DiffByteRecord::Modify {
delete,
add,
// TODO: this should be used in the future to highlight the column where differences
// occur
field_indices: _field_indices,
} => {

So it shouldn't be too difficult to implement your idea (famous last words?). 🙂

Unfortunately, I'm a bit busy lately, so didn't have the time currently.😢
However, mid/end August should be more time, so I can start implementing a prototype then. 🤞

With regard to your alternative solutions

janriemer pushed a commit to janriemer/qsv that referenced this issue Sep 8, 2024
This implements a new flag for the command `diff`. When activated, it
drops the values of fields that are equal within a row of type
`Modified` and replaces them with the empty string (an empty byte slice
to be precise). For now, the value for replacing equal values is not
configurable, but should be trivial to add in the future.

Note that key field values are _not_ dropped and always appear in the
output.

Example:
csv_left.csv    col1,col2,col3
                1,foo,bar

csv_right.csv   col1,col2,col3
                1,foo,baz

qsv diff --drop-equal-fields csv_left.csv csv_right.csv

Output:         diffresult;col1;col2;col3
                -;1,,bar
                +;1,,baz

See jqnatividad#2000
janriemer pushed a commit to janriemer/qsv that referenced this issue Sep 8, 2024
This implements a new flag for the command `diff`. When activated, it
drops the values of fields that are equal within a row of type
`Modified` and replaces them with the empty string (an empty byte slice
to be precise). For now, the value for replacing equal values is not
configurable, but should be trivial to add in the future.

Note that key field values are _not_ dropped and always appear in the
output.

Example:
csv_left.csv    col1,col2,col3
                1,foo,bar

csv_right.csv   col1,col2,col3
                1,foo,baz

qsv diff --drop-equal-fields csv_left.csv csv_right.csv

Output:         diffresult,col1,col2,col3
                -,1,,bar
                +,1,,baz

See jqnatividad#2000
@janriemer
Copy link

janriemer commented Sep 8, 2024

Hey @jqnatividad @mfripp 👋

here is the current status of the feature requests in this issue

  • 🎉 Add a flag for dropping equal values (diff: add flag --drop-equal-fields #2114)
  • ⏳ Do not output columns, which don't have different field values
    • this will require a change in csv-diff itself, because it is too costly (performance-wise) to implement it directly in diff command

For the other feature requests it is probably best to create separate issues for them, so that we don't lose the overview.

@mfripp
Copy link
Author

mfripp commented Sep 8, 2024

Thanks, this is great to see!

@jqnatividad
Copy link
Owner

Just merged #2114 ... just in time for qsv 0.134.0! Thanks @janriemer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants