added 'column numbers everywhere' to csvcut #1209

athalhammer · 2023-09-26T20:34:15Z

I've run frequently into the problem that I'm facing wide CSV files where there is some interesting string/text somewhere in a long line of empty fields or numbers.

Example:

In which columns is ipsum (first row, after header) actually showing up?

field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4

We want to approach this without manually counting the commas.

The new addition provides this and in combination with grep the result will be highlighted:

$ csvcut -N test.csv | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,"45~evil, ipsum, with commas,",46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4

$ csvcut -c 45,66,74 test.csv 
field45,field66,field74
"evil, ipsum, with commas,",ipsum,ipsum

This is not a breaking change as it keeps all columns where they are. It alters all the fields but it's sole purpose is for exploratory reasons.

Also, please let me know if you know a better or any solution how the above can be achieved without this modification.

athalhammer · 2023-10-01T12:53:13Z

@jpmckinney I put it for csvcut as it was the most straightforward code change. Maybe agate would also be a good place to put it but it probably triggers more code changes for a feature that is not frequently sought after (see https://stackoverflow.com/questions/55253173/return-the-column-name-for-a-column-containig-a-text-value)

athalhammer · 2023-10-01T14:30:49Z

@jpmckinney fixed the flake8 issue

athalhammer · 2023-10-03T17:31:57Z

@jpmckinney for some reason SQLAlchemy 2.x.x is installed in the runner - that triggers the job fails. In the previous jobs it was SQLAlchemy 1.4.x

https://github.com/wireservice/csvkit/actions/runs/6371290078/job/17361039598#step:4:186

athalhammer · 2023-10-03T17:34:10Z

culprit here: wireservice/agate-sql#40

athalhammer · 2023-10-03T17:47:00Z

@jpmckinney I pinned agate-sql to <0.6.0 to make the tests run for this PR. I believe we should address the migration to agate-sql 0.6.0 and SQLAlchemy>=2.x.x in a separate PR.

coveralls · 2023-10-03T18:01:15Z

coverage: 88.239% (-0.2%) from 88.443% when pulling 61eed27 on athalhammer:master into fa9e0db on wireservice:master.

coveralls · 2023-10-03T18:01:55Z

coverage: 88.362% (-0.08%) from 88.443% when pulling 61eed27 on athalhammer:master into fa9e0db on wireservice:master.

athalhammer · 2023-10-03T18:23:28Z

Sorry @jpmckinney , should have added a test on this feature earlier. Coverage should go up again with the last commit

jpmckinney · 2023-10-04T05:47:29Z

Hmm, what about putting this in csvlook?

athalhammer · 2023-10-04T06:51:49Z

I think having the option to combine it with large tables is intriguing, e.g. csvcut -c 700-800 -N ... The output can always be piped to csvlook.

One way how I have come to love this feature is by "collecting" relevant columns in the following way:

$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -i ipsum

In that way I can quickly collect all relevant "ipsum" columns in a long and wide table and eventually just leave the -N away and switch -C for -c

(see above)

$ csvcut -c 45,66,74 test.csv

upstream merge

jpmckinney · 2023-10-04T14:30:14Z

As implemented, this seems more like a new utility – like csvprefix – than a logical addition to csvcut. (Changing prefix to cut is the only additional work when switching between commands.) However, csvkit isn't accepting new utilities.

For your example, you can achieve the same with existing tools:

$ csvgrep -m ipsum -c 1-75 -a test.csv | tail +2 | csvgrep -n | grep ipsum
 45: evil, ipsum, with commas,
 66: ipsum
 74: ipsum

athalhammer · 2023-10-04T15:20:55Z

I agree with you, my example was not comprehensive enough (single line only):

Let's extend to the following (still not big enough, but I hope it transmits the point)

$ cat test.csv 
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5

Then I can see in one "glimpse" (without working through the data line by line with tail +x) which numbers pop up in the first x (head -n x) lines:

$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4
1~,2~,3~,4~,5~ipsum,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~4,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~41,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.98,41~,42~34,43~,44~,46~,47~lorem,48~,49~,50~lorem,51~,52~,53~,54~ipsum,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~,67~,68~,69~,70~,71~,72~3,73~,74~lorem,75~5

Then I can easily add 66,74,5,54 to my -C and I execute again (on a file with > 1000s of lines) to see whether I caught the "bulk of columns where 'ipsum' may show up". Empty result basically means: you have all the "ipsums" and then I can continue working with some "lorems" etc. to make sure I have identified all the columns where "a user gave negative feedback" (or something like that)

athalhammer · 2023-10-04T15:33:19Z

Maybe it fits more naturally to csvgrep as a parameter which indicates: "match any character but remove columns that don't have a match in any line"... But then you want be able to specify e.g. the primary key column and others to be always there (even though there is no match). It will also work only with -a, so something like:

csvgrep -a -m ipsum -c 1-75 --remove-non-matching-except 1,2

jpmckinney · 2023-10-04T16:24:57Z

csvgrep streams the data, so it can't remove non-matching columns.

To work on multiple lines, you can add xargs. I use \x1e (record separator character) in case the cells contain newlines.

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
  5: ipsum
 54: ipsum

athalhammer · 2023-10-04T17:07:28Z

okay, point taken, so the conclusion would be to add a new tool named csvprefix that has the option to use column names or numbers as prefixes in cells and maybe some user-provided string arguments and a separator parameter (default ~) and maybe a "--include-header" parameter?

csvprefix --column-numbers  --include-header test.csv
...

csvprefix --column-numbers --separator "=" test.csv
...

csvprefix --column-names test.csv

csvprefix --prefix "asdf" test.csv
...

jpmckinney · 2023-10-04T17:33:22Z

Is there an issue with using xargs? csvkit's philosophy is to compose commands with pipes and to not invent new commands if existing ones are fine.

csvkit is in maintenance mode, so only existing commands are being maintained. Additional commands would have to be created as independent Python packages.

athalhammer · 2023-10-04T17:42:56Z

Hm, I'm not an expert and probably there's a way to fix it but I could not get it run with:

$ cat test.csv 
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,"fdsa

 ipsum",,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5

At the same time, it revealed that it may not be the best idea either as it tries to execute content as code:

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
,,,,ipsum,,,,,,,,,,4,,,,,,,,,,41,,,,fdsa
sh: 3: ipsum,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5: not found
StopIteration:

…ching text, #1209

jpmckinney · 2023-10-04T17:47:04Z

You just need to quote the replstr in xargs:

csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum

athalhammer · 2023-10-04T17:51:21Z

That again returns invalid results for me

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
  5: ipsum

(the file only has 75 columns)

jpmckinney · 2023-10-04T17:54:32Z

Ah, yeah, need to use single quotes:

csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c "echo '_' | csvcut -n" | grep ipsum

jpmckinney · 2023-10-04T18:03:31Z

Even more robust:

csvgrep -m ipsum -a -c 1- test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep ipsum

athalhammer · 2023-10-04T18:19:19Z

The last one does the trick. Thanks for adding it to the docs - I will need to look this up now every time.

Still believe that the -N option (for the 10 loc it adds) was more elegant, has probably many more use cases (in addition to the examined one), and is not completely off to hang out in csvcut (one could ask why -n is there and not in some csvinfo tool, too)

jpmckinney · 2023-10-04T19:30:17Z

This would have been better as an issue, but I'll close this PR for now, and perhaps if there is any interest from other users they will still find the PR discussion.

athalhammer · 2023-10-04T19:38:59Z

I agree with you, @jpmckinney, that this would have been better as an issue. Thanks for your time!

…issues

athalhammer force-pushed the master branch 4 times, most recently from 04189d6 to a8eaadb Compare September 26, 2023 21:06

athalhammer changed the title ~~added 'line numbers everywhere' to csvcut~~ added 'column numbers everywhere' to csvcut Sep 26, 2023

athalhammer added 2 commits September 26, 2023 23:21

added 'column numbers everywhere' to csvcut

fcc5571

added compatibility to --zero

8460d8c

athalhammer force-pushed the master branch from a8eaadb to 8460d8c Compare September 26, 2023 21:22

made flake8 compliant

c686943

athalhammer force-pushed the master branch from a5b9886 to c686943 Compare October 4, 2023 06:53

athalhammer and others added 4 commits October 4, 2023 08:54

Merge pull request #1 from wireservice/master

62ca0bd

upstream merge

adding test for csvcut -N

9530992

upstream merge

6567c2c

fixed ctrl+p/ctrl+v error on function name

ac33cd5

jpmckinney added a commit that referenced this pull request Oct 4, 2023

docs: Document how to get the indices of the columns that contain mat…

66e7086

…ching text, #1209

jpmckinney closed this Oct 4, 2023

athalhammer referenced this pull request Oct 4, 2023

docs(csvgrep): Use variable ($1) instead of replstr to avoid quoting …

c4e5612

…issues

athalhammer mentioned this pull request Oct 4, 2023

Performant way of knowing in which column(s) a value (string or number) appears #1214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added 'column numbers everywhere' to csvcut #1209

added 'column numbers everywhere' to csvcut #1209

athalhammer commented Sep 26, 2023 •

edited

Loading

athalhammer commented Oct 1, 2023 •

edited

Loading

athalhammer commented Oct 1, 2023

athalhammer commented Oct 3, 2023

athalhammer commented Oct 3, 2023

athalhammer commented Oct 3, 2023

coveralls commented Oct 3, 2023

coveralls commented Oct 3, 2023

athalhammer commented Oct 3, 2023

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023

jpmckinney commented Oct 4, 2023

jpmckinney commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023

added 'column numbers everywhere' to csvcut #1209

added 'column numbers everywhere' to csvcut #1209

Conversation

athalhammer commented Sep 26, 2023 • edited Loading

athalhammer commented Oct 1, 2023 • edited Loading

athalhammer commented Oct 1, 2023

athalhammer commented Oct 3, 2023

athalhammer commented Oct 3, 2023

athalhammer commented Oct 3, 2023

coveralls commented Oct 3, 2023

coveralls commented Oct 3, 2023

athalhammer commented Oct 3, 2023

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023 • edited Loading

jpmckinney commented Oct 4, 2023 • edited Loading

athalhammer commented Oct 4, 2023 • edited Loading

athalhammer commented Oct 4, 2023 • edited Loading

jpmckinney commented Oct 4, 2023 • edited Loading

athalhammer commented Oct 4, 2023 • edited Loading

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023

jpmckinney commented Oct 4, 2023

jpmckinney commented Oct 4, 2023 • edited Loading

athalhammer commented Oct 4, 2023 • edited Loading

jpmckinney commented Oct 4, 2023

athalhammer commented Oct 4, 2023

athalhammer commented Sep 26, 2023 •

edited

Loading

athalhammer commented Oct 1, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading

jpmckinney commented Oct 4, 2023 •

edited

Loading

athalhammer commented Oct 4, 2023 •

edited

Loading