Performant way of knowing in which column(s) a value (string or number) appears #1214

athalhammer · 2023-10-04T21:35:55Z

This issue has already some track record documented in #1209.

The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:

$ command xy "service"
7 567
102 89
3 256

To be read as: In column 567, the word "service" appeared seven times, in column 89 it appeared 102 times, and in column 256 it appeared 3 times. Now I can investigate each of these columns further.

In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other csvkit and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612), and closed the PR.

csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222

Momentarily I was happy with the conclusion and the re-use of existing tools.

However, I quickly came to realize that this is not performing well. Here is an example:

Prep:

cp examples/realdata/FY09_EDU_Recipients_by_State.csv test.csv
for i in {1..200}; do tail -n+2 examples/realdata/FY09_EDU_Recipients_by_State.csv; done >> test.csv

This results in little more than 10k lines.

Now I do a search for 679 (our "service" in this case) as per suggested method and time it:

time csvgrep -m 679 -a -c 1- FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 679 | sort | uniq -c
    201   4: 679
    201   6: 679

real	1m4.553s
user	0m57.484s
sys	0m7.391s

Compare this to the suggested method in #1209:

# explain first what -N does:
csvcut -h | grep "\-N"
              [-C NOT_COLUMNS] [-x] [-N]
  -N, --column-numbers  Add column numbering everywhere, in header and cells.
  
# actual timing:
time csvcut -N ../csvkit/FY09_EDU_Recipients_by_State.csv | sed "s/.*\([0-9]\+~679\).*/\1/gp" -n | sort | uniq -c
    201 4~679
    201 6~679

real	0m0.236s
user	0m0.237s
sys	0m0.020s

The times are roughly a minute apart and this is not an extremely big file yet (few columns).

Questions to the wider audience:

Is this a relevant problem for someone else?
Does anyone have a good and performing solution with existing tools?
If we were to "add" to csvkit - which tool should/could it be?

The text was updated successfully, but these errors were encountered:

jpmckinney · 2023-10-05T01:53:04Z

Hmm, yes, the shell version is going to be slow, mainly because csvcut needs to start up for every match.

The need, however, seems a bit narrow, for which custom code (like the code you wrote) is the right approach.

If there's sufficient demand and a clear place to implement this feature, it can be done.

athalhammer · 2023-10-07T18:37:36Z

I'm not sure what is the perfect way of skimming through long and wide CSV file in command line and visually make sense out of what is displayed. The max-column-width feature of csvlook only works up to 10-20 columns. As soon as multiple line breaks appear due to wrapping it is basically impossible to relate the displayed value of a cell to the respective column. The initial idea of #1209 was to address this in the first place

jpmckinney added the feature label Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performant way of knowing in which column(s) a value (string or number) appears #1214

Performant way of knowing in which column(s) a value (string or number) appears #1214

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 5, 2023

athalhammer commented Oct 7, 2023

Performant way of knowing in which column(s) a value (string or number) appears #1214

Performant way of knowing in which column(s) a value (string or number) appears #1214

Comments

athalhammer commented Oct 4, 2023 • edited Loading

jpmckinney commented Oct 5, 2023

athalhammer commented Oct 7, 2023

athalhammer commented Oct 4, 2023 •

edited

Loading