Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main memory overloading when training using DICE-embeddings library #1

Open
sshivam95 opened this issue Jun 9, 2024 · 7 comments
Open

Comments

@sshivam95
Copy link
Collaborator

sshivam95 commented Jun 9, 2024

The RAM is getting overloaded because the unique entities and relations are stored in RAM memory on GPU nodes of Noctua 1 (180GB usable main memory) and on Noctua 2 (470GB usable main memory). This leads to an Out of Memory (OOM) error in the SLURM.

@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 9, 2024

A solution to point 1 is to generate the indices of unique entities and relations before hand and convert the dataset into an index transformed dataset.
Idea of incremental saving: #2
To avoid memory kill issue, once the shape of a numpy.memmap reaches a threshold (say 1 million triples), dump it in a backup file (initially, a .pickle file) and clear the memory mapped variable.
Once the memory mapped variable reads the next round of threshold, the data from the pickle file is updated with the entries from the new memap variable. This updates the data mapping in the pickle file without use of any variable overloading the RAM memory. This reduces the use of the RAM and avoids a memory kill error.

@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 9, 2024

Initially, ran individual tests on different portions of the dataset to test this approach in a pickle file. It works for smaller datasets up to 2 million triples but fails after that.

@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 9, 2024

Alternative solution: Issue 2 comment

@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 9, 2024

Another proposal is to use mmappickle library which is designed for “unstructured '' parallel access, with a strong emphasis on adding new data. #4

Issues: - The indexing is done directly to a memory mapped file in the form of dictionaries using mmappickle.mmapdict

  • This method takes a very long time because of an I/O bottleneck.
  • After finding unique entities and relations for a chunk, each of these entities and relations are indexed in the mmappickle.mmapdict file.
  • This creates a bottleneck between the cluster node and the storage memory and hence the processing is very slow.
  • Example, for the full KG (cleaned) with $57,189,425,968$ triples, a chunk of $10$ million triples have $5,037,674$ unique entities and $1,123$ unique relations. Note: both are indexed parallely. The indexing of relations took $43$ seconds but after $17$ hours, only $26,200$ triples were processed. This is very slow.

This process of writing to a memory mapped file in the Parallel File System of Noctua clusters is very slow because lustre has a very bad management for memory mapped files. #5

@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 9, 2024

Another solution is to use the DGX partition nodes which have $~10$ TB local fast NVME-SSD storage and $8$ GPUs . either use the in-memory or SSDs.

After running the training test on 1 chunk (10 million triples) using dice-embedding library, we get the following file sizes:

  • entity.p: $167$ MB
  • relation.p: $41$ KB
  • train_set.npy: $115$ MB

The estimated size of files for full dataset ($57$ billion triples):

  • relation.p $~230$ MB
  • entities.p $≤2$ TB
  • train_set.npy $~ 1-2$ TB

@sshivam95
Copy link
Collaborator Author

A workaround is to create indexed train_set.npy before hand rather than making dice-embeddings to create them using B+ tree implementation in C++.

@sshivam95
Copy link
Collaborator Author

Update: Issue #9 creates a workaround for training embedding models from individual graphs by splitting the dataset based on domain. A domain for a triple is defined as the authority base URL in their namespace. Split the dataset by creating different dataset files based on domain names and then train the models based on these small graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant