You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue has already some track record documented in #1209.
The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:
$ command xy "service"
7 567
102 89
3 256
To be read as: In column 567, the word "service" appeared seven times, in column 89 it appeared 102 times, and in column 256 it appeared 3 times. Now I can investigate each of these columns further.
In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other csvkit and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612), and closed the PR.
csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222
Momentarily I was happy with the conclusion and the re-use of existing tools.
However, I quickly came to realize that this is not performing well. Here is an example:
Prep:
cp examples/realdata/FY09_EDU_Recipients_by_State.csv test.csv
for i in {1..200}; do tail -n+2 examples/realdata/FY09_EDU_Recipients_by_State.csv; done >> test.csv
This results in little more than 10k lines.
Now I do a search for 679 (our "service" in this case) as per suggested method and time it:
time csvgrep -m 679 -a -c 1- FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 679 | sort | uniq -c
201 4: 679
201 6: 679
real 1m4.553s
user 0m57.484s
sys 0m7.391s
I'm not sure what is the perfect way of skimming through long and wide CSV file in command line and visually make sense out of what is displayed. The max-column-width feature of csvlook only works up to 10-20 columns. As soon as multiple line breaks appear due to wrapping it is basically impossible to relate the displayed value of a cell to the respective column. The initial idea of #1209 was to address this in the first place
This issue has already some track record documented in #1209.
The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:
To be read as: In column
567
, the word "service" appeared seven times, in column89
it appeared102
times, and in column256
it appeared 3 times. Now I can investigate each of these columns further.In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other
csvkit
and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612), and closed the PR.csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222
Momentarily I was happy with the conclusion and the re-use of existing tools.
However, I quickly came to realize that this is not performing well. Here is an example:
Prep:
This results in little more than 10k lines.
Now I do a search for
679
(our "service" in this case) as per suggested method and time it:Compare this to the suggested method in #1209:
The times are roughly a minute apart and this is not an extremely big file yet (few columns).
Questions to the wider audience:
csvkit
- which tool should/could it be?The text was updated successfully, but these errors were encountered: