Virgo HDF5 file format #240

jarlsondre · 2024-11-07T08:28:51Z

Summary

Updates the data loading and synthetic data generation to use the HDF5 file format instead of pickle files. This makes the code both shorter and easier to work with. In particular, we are no longer limited to storing exactly 500 rows per file, but can rather choose a chunk size that suits us after the data has been created. Additionally, due to the functionality of the HDF5 format, all the data is stored in a single file without any issues with loading too much data into memory during runtime.

Related issue :
None

use-cases/virgo/config.yaml

use-cases/virgo/data.py

matbun · 2024-11-07T13:50:53Z

use-cases/virgo/synthetic_data_gen/file_gen.py

+# sys.path.append(str(Path("..").resolve()))
+sys.path.append(str(Path.cwd().resolve()))


interesting. What is this doing?

For some reason, the import was not able to be resolved, so I added the folder to Path. The reason I add cwd is because I usually run this from the use-cases/virgo/ folder instead of the synthetic_data_gen folder. The reason the import is kinda weird is because the use-cases folder sits outside of the src folder, meaning that it's not considered a part of the itwinai library.

matbun · 2024-11-07T13:58:39Z

use-cases/virgo/synthetic_data_gen/file_gen.py

+    with h5py.File(file_path, "a") as f:
+        dset = f[dataset_name]
+        dset.resize(dset.shape[0] + array.shape[0], axis=0)
+        dset[-array.shape[0] :] = array


is this thread safe? Perhaps HDF5 is ensuring it somehow?

I doubt it's thread safe, but I am using a single file per process so I don't think it should be a problem.

…go-hdf5

* add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>

* add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>

jarlsondre added 5 commits November 6, 2024 14:17

update virgo generated dataset to use hdf5 format

3f7282f

add functionality for selecting output location

9a39414

set new data format as standard

865eda9

make virgo work with new data loader and add progress bar

01ae8c0

fix merge conflict

31fdbd2

jarlsondre added the enhancement New feature or request label Nov 7, 2024

jarlsondre self-assigned this Nov 7, 2024

matbun reviewed Nov 7, 2024

View reviewed changes

jarlsondre marked this pull request as draft November 7, 2024 15:04

jarlsondre and others added 10 commits November 7, 2024 16:32

remove old generation files and add script for concatenating hdf5 files

3e674e6

remove old generation files and add script for concatenating hdf5 files

4bc8cd6

Merge branch 'virgo-hdf5' of github.com:interTwin-eu/itwinai into vir…

ec2e589

…go-hdf5

rename folder using hyphens

35d86ab

remove multiprocessing

4dc1dea

add multiprocessing at correct place

812696f

update handling of seed and num processes

c039052

make virgo work with new data loader and add progress bar

1d1343f

jarlsondre added bug Something isn't working enhancement New feature or request and removed enhancement New feature or request bug Something isn't working labels Nov 8, 2024

jarlsondre added 7 commits November 8, 2024 11:30

add contributors

060e572

update ruff settings in pyproject

8c18316

update virgo dataset concatenation

bed5249

fix merge

b7e72a7

add isort option to ruff

46b8c96

break imports on purpose

933fce0

break more imports to test

c094bb6

jarlsondre marked this pull request as ready for review November 8, 2024 15:25

jarlsondre marked this pull request as draft November 8, 2024 15:25

remove ruff config file

0017e0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virgo HDF5 file format #240

Virgo HDF5 file format #240

jarlsondre commented Nov 7, 2024

matbun Nov 7, 2024

jarlsondre Nov 7, 2024

matbun Nov 7, 2024

jarlsondre Nov 7, 2024

		# sys.path.append(str(Path("..").resolve()))
		sys.path.append(str(Path.cwd().resolve()))

Virgo HDF5 file format #240

Are you sure you want to change the base?

Virgo HDF5 file format #240

Conversation

jarlsondre commented Nov 7, 2024

Summary

matbun Nov 7, 2024

Choose a reason for hiding this comment

jarlsondre Nov 7, 2024

Choose a reason for hiding this comment

matbun Nov 7, 2024

Choose a reason for hiding this comment

jarlsondre Nov 7, 2024

Choose a reason for hiding this comment