Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CSV parser to stream file to handle very large volume of data #7594

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

richard-julien
Copy link
Member

See #7589

@richard-julien richard-julien added the filigran team use to identify PR from the Filigran team label Jul 4, 2024
@richard-julien richard-julien self-assigned this Jul 4, 2024
Copy link

codecov bot commented Jul 4, 2024

Codecov Report

Attention: Patch coverage is 47.30539% with 88 lines in your changes missing coverage. Please review.

Project coverage is 67.52%. Comparing base (3ab41ac) to head (d06f72d).
Report is 215 commits behind head on master.

Files Patch % Lines
...hql/src/connector/importCsv/importCsv-connector.ts 0.00% 64 Missing ⚠️
...rm/opencti-graphql/src/manager/ingestionManager.ts 18.18% 9 Missing ⚠️
...phql/src/modules/ingestion/ingestion-csv-domain.ts 16.66% 5 Missing ⚠️
...src/modules/internal/csvMapper/csvMapper-domain.ts 50.00% 3 Missing ⚠️
.../modules/internal/csvMapper/csvMapper-resolvers.ts 33.33% 2 Missing ⚠️
.../internal/csvMapper/deprecated/csvMapper-domain.ts 92.85% 2 Missing ⚠️
...-platform/opencti-graphql/src/parser/csv-parser.ts 86.66% 2 Missing ⚠️
...-platform/opencti-graphql/src/database/rabbitmq.js 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7594      +/-   ##
==========================================
- Coverage   67.56%   67.52%   -0.04%     
==========================================
  Files         567      570       +3     
  Lines       69946    69996      +50     
  Branches     5937     5927      -10     
==========================================
+ Hits        47257    47266       +9     
- Misses      22689    22730      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

filename: `${workId}.json`,
mimetype: 'application/json',
};
await uploadToStorage(context, applicantUser, 'import/pending', file, { entity });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it will upload an analyst workbench file with a bundle for the current lines that are being parsed ? does it append the file or replace it only ? From what I read it looks like it will replace it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good point, i need to dig in into this

@aHenryJard aHenryJard self-assigned this Jul 9, 2024
Copy link
Member

@aHenryJard aHenryJard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an issue with count and number, so I have some doubt on the split data part. I tested will all french cities from https://www.data.gouv.fr/fr/datasets/villes-de-france/.

My csv file has an header and 39146 lines

wc -l cities.csv
39146 cities.csv

But in the import screen I see 32759 expected lines=>
image

And at the end in OpenCTI UI in data > entities I have 32759 (when I'm expecting 39145). I only got 1 error in one line of the csv.

I'm adding the csv mapper that I used:
image

So I think that the split in chunck of CSV_MAX_BUNDLE_SIZE_GENERATION is not correct somehow

@aHenryJard
Copy link
Member

I have an issue with count and number, so I have some doubt on the split data part. I tested will all french cities from https://www.data.gouv.fr/fr/datasets/villes-de-france/.

My csv file has an header and 39146 lines

wc -l cities.csv
39146 cities.csv

But in the import screen I see 32759 expected lines=> image

And at the end in OpenCTI UI in data > entities I have 32759 (when I'm expecting 39145). I only got 1 error in one line of the csv.

I'm adding the csv mapper that I used: image

So I think that the split in chunck of CSV_MAX_BUNDLE_SIZE_GENERATION is not correct somehow

Actually there is some duplicate, but I'm still missing 6 cities, I will look

@aHenryJard
Copy link
Member

@richard-julien FYI if it's fine with you, I'm going to update this branch and review in the following days/ weeks as part of #7400 feature.

@richard-julien
Copy link
Member Author

richard-julien commented Oct 2, 2024

@richard-julien FYI if it's fine with you, I'm going to update this branch and review in the following days/ weeks as part of #7400 feature.

Ok great! Thanks.
We need to discuss about the generation of workbench that currently is not handled by my approach.
Maybe we need to wait for the draft for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filigran team use to identify PR from the Filigran team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve CSV parser to stream file to handle very large volume of data
4 participants