Main memory overloading when training using DICE-embeddings library #1

sshivam95 · 2024-06-09T20:10:43Z

The RAM is getting overloaded because the unique entities and relations are stored in RAM memory on GPU nodes of Noctua 1 (180GB usable main memory) and on Noctua 2 (470GB usable main memory). This leads to an Out of Memory (OOM) error in the SLURM.

sshivam95 · 2024-06-09T20:11:01Z

A solution to point 1 is to generate the indices of unique entities and relations before hand and convert the dataset into an index transformed dataset.
Idea of incremental saving: #2
To avoid memory kill issue, once the shape of a numpy.memmap reaches a threshold (say 1 million triples), dump it in a backup file (initially, a .pickle file) and clear the memory mapped variable.
Once the memory mapped variable reads the next round of threshold, the data from the pickle file is updated with the entries from the new memap variable. This updates the data mapping in the pickle file without use of any variable overloading the RAM memory. This reduces the use of the RAM and avoids a memory kill error.

sshivam95 · 2024-06-09T20:13:20Z

Initially, ran individual tests on different portions of the dataset to test this approach in a pickle file. It works for smaller datasets up to 2 million triples but fails after that.

sshivam95 · 2024-06-09T20:15:58Z

Alternative solution: Issue 2 comment

sshivam95 · 2024-06-09T20:17:19Z

Another proposal is to use mmappickle library which is designed for “unstructured '' parallel access, with a strong emphasis on adding new data. #4

Issues: - The indexing is done directly to a memory mapped file in the form of dictionaries using mmappickle.mmapdict

This method takes a very long time because of an I/O bottleneck.
After finding unique entities and relations for a chunk, each of these entities and relations are indexed in the mmappickle.mmapdict file.
This creates a bottleneck between the cluster node and the storage memory and hence the processing is very slow.
Example, for the full KG (cleaned) with $57,189,425,968$ triples, a chunk of $10$ million triples have $5,037,674$ unique entities and $1,123$ unique relations. Note: both are indexed parallely. The indexing of relations took $43$ seconds but after $17$ hours, only $26,200$ triples were processed. This is very slow.

This process of writing to a memory mapped file in the Parallel File System of Noctua clusters is very slow because lustre has a very bad management for memory mapped files. #5

sshivam95 · 2024-06-09T20:26:37Z

Another solution is to use the DGX partition nodes which have $~10$ TB local fast NVME-SSD storage and $8$ GPUs . either use the in-memory or SSDs.

After running the training test on 1 chunk (10 million triples) using dice-embedding library, we get the following file sizes:

entity.p: $167$ MB
relation.p: $41$ KB
train_set.npy: $115$ MB

The estimated size of files for full dataset ($57$ billion triples):

relation.p $~230$ MB
entities.p $≤2$ TB
train_set.npy $~ 1-2$ TB

sshivam95 · 2024-06-11T08:06:34Z

A workaround is to create indexed train_set.npy before hand rather than making dice-embeddings to create them using B+ tree implementation in C++.

sshivam95 · 2024-06-13T08:50:12Z

Update: Issue #9 creates a workaround for training embedding models from individual graphs by splitting the dataset based on domain. A domain for a triple is defined as the authority base URL in their namespace. Split the dataset by creating different dataset files based on domain names and then train the models based on these small graphs.

sshivam95 mentioned this issue Jun 11, 2024

Incremental saving approach #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main memory overloading when training using DICE-embeddings library #1

Main memory overloading when training using DICE-embeddings library #1

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 11, 2024

sshivam95 commented Jun 13, 2024

Main memory overloading when training using DICE-embeddings library #1

Main memory overloading when training using DICE-embeddings library #1

Comments

sshivam95 commented Jun 9, 2024 • edited Loading

sshivam95 commented Jun 9, 2024 • edited Loading

sshivam95 commented Jun 9, 2024 • edited Loading

sshivam95 commented Jun 9, 2024 • edited Loading

sshivam95 commented Jun 9, 2024 • edited Loading

sshivam95 commented Jun 9, 2024 • edited Loading

sshivam95 commented Jun 11, 2024

sshivam95 commented Jun 13, 2024

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading

sshivam95 commented Jun 9, 2024 •

edited

Loading