Improve CSV parser to stream file to handle very large volume of data #7594

richard-julien · 2024-07-04T09:58:53Z

…arge volume of data (#7589)

codecov · 2024-07-04T10:21:59Z

Codecov Report

Attention: Patch coverage is 47.30539% with 88 lines in your changes missing coverage. Please review.

Project coverage is 67.52%. Comparing base (3ab41ac) to head (d06f72d).
Report is 215 commits behind head on master.

Files	Patch %	Lines
...hql/src/connector/importCsv/importCsv-connector.ts	0.00%	64 Missing ⚠️
...rm/opencti-graphql/src/manager/ingestionManager.ts	18.18%	9 Missing ⚠️
...phql/src/modules/ingestion/ingestion-csv-domain.ts	16.66%	5 Missing ⚠️
...src/modules/internal/csvMapper/csvMapper-domain.ts	50.00%	3 Missing ⚠️
.../modules/internal/csvMapper/csvMapper-resolvers.ts	33.33%	2 Missing ⚠️
.../internal/csvMapper/deprecated/csvMapper-domain.ts	92.85%	2 Missing ⚠️
...-platform/opencti-graphql/src/parser/csv-parser.ts	86.66%	2 Missing ⚠️
...-platform/opencti-graphql/src/database/rabbitmq.js	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7594      +/-   ##
==========================================
- Coverage   67.56%   67.52%   -0.04%     
==========================================
  Files         567      570       +3     
  Lines       69946    69996      +50     
  Branches     5937     5927      -10     
==========================================
+ Hits        47257    47266       +9     
- Misses      22689    22730      +41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SouadHadjiat · 2024-07-08T13:11:35Z

opencti-platform/opencti-graphql/src/connector/importCsv/importCsv-connector.ts

+        filename: `${workId}.json`,
+        mimetype: 'application/json',
+      };
+      await uploadToStorage(context, applicantUser, 'import/pending', file, { entity });


so it will upload an analyst workbench file with a bundle for the current lines that are being parsed ? does it append the file or replace it only ? From what I read it looks like it will replace it.

Really good point, i need to dig in into this

aHenryJard

I have an issue with count and number, so I have some doubt on the split data part. I tested will all french cities from https://www.data.gouv.fr/fr/datasets/villes-de-france/.

My csv file has an header and 39146 lines

wc -l cities.csv
39146 cities.csv

But in the import screen I see 32759 expected lines=>

And at the end in OpenCTI UI in data > entities I have 32759 (when I'm expecting 39145). I only got 1 error in one line of the csv.

I'm adding the csv mapper that I used:

So I think that the split in chunck of CSV_MAX_BUNDLE_SIZE_GENERATION is not correct somehow

opencti-platform/opencti-graphql/src/connector/importCsv/importCsv-connector.ts

aHenryJard · 2024-07-16T12:56:13Z

I have an issue with count and number, so I have some doubt on the split data part. I tested will all french cities from https://www.data.gouv.fr/fr/datasets/villes-de-france/.

My csv file has an header and 39146 lines
wc -l cities.csv
39146 cities.csv
But in the import screen I see 32759 expected lines=>

And at the end in OpenCTI UI in data > entities I have 32759 (when I'm expecting 39145). I only got 1 error in one line of the csv.

I'm adding the csv mapper that I used:

So I think that the split in chunck of CSV_MAX_BUNDLE_SIZE_GENERATION is not correct somehow

Actually there is some duplicate, but I'm still missing 6 cities, I will look

aHenryJard · 2024-10-02T09:23:10Z

@richard-julien FYI if it's fine with you, I'm going to update this branch and review in the following days/ weeks as part of #7400 feature.

richard-julien · 2024-10-02T11:29:51Z

@richard-julien FYI if it's fine with you, I'm going to update this branch and review in the following days/ weeks as part of #7400 feature.

Ok great! Thanks.
We need to discuss about the generation of workbench that currently is not handled by my approach.
Maybe we need to wait for the draft for this.

richard-julien added 4 commits July 3, 2024 22:36

[backend/frontend] Improve CSV parser to stream file to handle very l…

b79eaf6

…arge volume of data (#7589)

[frontend] Fix imports (#7589)

f940b4a

[frontend] Change typing from query to mutation (#7589)

f3b5481

[frontend] Remove any (#7589)

d06f72d

richard-julien added the filigran team use to identify PR from the Filigran team label Jul 4, 2024

richard-julien requested review from SouadHadjiat, aHenryJard and labo-flg July 4, 2024 10:00

richard-julien self-assigned this Jul 4, 2024

richard-julien linked an issue Jul 4, 2024 that may be closed by this pull request

Improve CSV parser to stream file to handle very large volume of data #7589

Open

SouadHadjiat reviewed Jul 8, 2024

View reviewed changes

aHenryJard self-assigned this Jul 9, 2024

aHenryJard reviewed Jul 9, 2024

View reviewed changes

opencti-platform/opencti-graphql/src/connector/importCsv/importCsv-connector.ts Show resolved Hide resolved

aHenryJard removed their assignment Sep 4, 2024

SamuelHassine force-pushed the master branch from f8359b7 to 558e6c1 Compare September 17, 2024 15:10

richard-julien marked this pull request as draft October 1, 2024 20:52

aHenryJard assigned aHenryJard and CelineSebe Oct 2, 2024

aHenryJard mentioned this pull request Oct 8, 2024

[backend] Allow csv mapper to import entities present in several row and stream csv reading (#7400)(#7589) #8638

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CSV parser to stream file to handle very large volume of data #7594

Improve CSV parser to stream file to handle very large volume of data #7594

richard-julien commented Jul 4, 2024

codecov bot commented Jul 4, 2024 •

edited

Loading

SouadHadjiat Jul 8, 2024

richard-julien Aug 18, 2024

aHenryJard left a comment

aHenryJard commented Jul 16, 2024

aHenryJard commented Oct 2, 2024

richard-julien commented Oct 2, 2024 •

edited

Loading

Improve CSV parser to stream file to handle very large volume of data #7594

Are you sure you want to change the base?

Improve CSV parser to stream file to handle very large volume of data #7594

Conversation

richard-julien commented Jul 4, 2024

codecov bot commented Jul 4, 2024 • edited Loading

Codecov Report

SouadHadjiat Jul 8, 2024

Choose a reason for hiding this comment

richard-julien Aug 18, 2024

Choose a reason for hiding this comment

aHenryJard left a comment

Choose a reason for hiding this comment

aHenryJard commented Jul 16, 2024

aHenryJard commented Oct 2, 2024

richard-julien commented Oct 2, 2024 • edited Loading

codecov bot commented Jul 4, 2024 •

edited

Loading

richard-julien commented Oct 2, 2024 •

edited

Loading