Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Allow csv mapper to import entities present in several row and stream csv reading (#7400)(#7589) #8638

Open
wants to merge 8 commits into
base: release/6.4.0
Choose a base branch
from

Conversation

aHenryJard
Copy link
Member

@aHenryJard aHenryJard commented Oct 8, 2024

Proposed changes

  1. Allow CSV Mapper to import entities present in several row:
  • stop removing all duplicated stix id in the bundle generated from CSV, instead do a diff on all entity fields to check if it's a full duplicate or not
  • since worker is also removing duplicates on bundle based on the generated stix id; when 2 csv lines generates the same stix id a new bundle is created.
  • filter bundle before send to keep last version of a duplicated entity (instead of first one in the current code)
  1. Stream imported csv files to avoid loading the whole file in memory:
  • add a parameter in importCsv-connector for the bulk size of data to process (bulk_creation_size)
  • based on Improve CSV parser to stream file to handle very large volume of data #7594 work
  • porcess of reading : open minIO file, read the bulk_creation_size line, close minIO file, parse and run csv mapper on the bulk. reason: Keeping the minio file open causes socket timeout issue on huge files.
  • workbench are out of stream scope, the previous code is still used (with a comment on deprecated code)
  • deprecation on the test part (because API changes)

Impacts:

  • CSV Mapper, csv mapper test
  • Data > import csv file
  • Entitiy > Data import
  • CSV Feed (only for the entity in several rows, not concerned by stream)

Related issues

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality
  • I wrote test cases for the relevant uses case (coverage and e2e)
  • I added/update the relevant documentation (either on github or on notion)
  • Where necessary I refactored code to improve the overall quality

Further comments

@aHenryJard aHenryJard changed the title [backend] stream csv data with new csv connector implementation (#7400)(#7589) [backend] Improve csv data parser to handle very large volume (#7400)(#7589) Oct 8, 2024
@labo-flg labo-flg force-pushed the release/6.4.0 branch 2 times, most recently from c150d0f to 264e053 Compare October 15, 2024 20:34
@aHenryJard aHenryJard changed the title [backend] Improve csv data parser to handle very large volume (#7400)(#7589) [backend] Allow csv mapper to import entities present in several row and stream csv reading (#7400)(#7589) Oct 16, 2024
@aHenryJard aHenryJard force-pushed the issue/7589-rework branch 2 times, most recently from 4632c1c to a2028f9 Compare October 17, 2024 13:51
Copy link

codecov bot commented Oct 17, 2024

Codecov Report

Attention: Patch coverage is 69.66581% with 118 lines in your changes missing coverage. Please review.

Project coverage is 64.57%. Comparing base (3e0f5a7) to head (0640cdf).
Report is 1 commits behind head on release/6.4.0.

Files with missing lines Patch % Lines
...hql/src/connector/importCsv/importCsv-connector.ts 50.00% 78 Missing ⚠️
...phql/src/modules/ingestion/ingestion-csv-domain.ts 0.00% 11 Missing ⚠️
...platform/opencti-graphql/src/parser/csv-bundler.ts 89.89% 10 Missing ⚠️
...src/modules/internal/csvMapper/csvMapper-domain.ts 10.00% 9 Missing ⚠️
...rm/opencti-graphql/src/manager/ingestionManager.ts 94.87% 2 Missing ⚠️
.../modules/internal/csvMapper/csvMapper-resolvers.ts 33.33% 2 Missing ⚠️
.../internal/csvMapper/deprecated/csvMapper-domain.ts 88.23% 2 Missing ⚠️
...-platform/opencti-graphql/src/parser/csv-parser.ts 84.61% 2 Missing ⚠️
...-platform/opencti-graphql/src/database/rabbitmq.js 0.00% 1 Missing ⚠️
...pencti-platform/opencti-graphql/src/domain/work.js 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##           release/6.4.0    #8638      +/-   ##
=================================================
+ Coverage          64.33%   64.57%   +0.23%     
=================================================
  Files                608      611       +3     
  Lines              58023    58238     +215     
  Branches            6403     6453      +50     
=================================================
+ Hits               37330    37606     +276     
+ Misses             20693    20632      -61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aHenryJard aHenryJard force-pushed the issue/7589-rework branch 3 times, most recently from 1175a88 to dd925da Compare October 29, 2024 17:02
@aHenryJard aHenryJard marked this pull request as ready for review November 4, 2024 18:52
@aHenryJard aHenryJard added the filigran team use to identify PR from the Filigran team label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filigran team use to identify PR from the Filigran team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants