Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with ogbl-biokg graph in DeepSNAP #40

Open
sophiakrix opened this issue Jan 20, 2022 · 1 comment
Open

Performance issues with ogbl-biokg graph in DeepSNAP #40

sophiakrix opened this issue Jan 20, 2022 · 1 comment

Comments

@sophiakrix
Copy link

sophiakrix commented Jan 20, 2022

Hello,

I am trying to use the ogbl-biokg (docs | github) with the DeepSNAP package. The graph has 5.088.434 edges and 93.773 nodes. I created a custom dataset (link to the code), but I have massive performance issues.

The problem is that it takes more than 30 min for the graph to process and generate the HeteroGraph object:

hetero = HeteroGraph(G)

And that the memory consumption is too much, even for a node with 256GB when I start the training, so it always crashes. I am using it in the link prediction with the heterogeneous GraphSAGE model (tutorial colab from DeepSNAP).
I think the problem might be using networkx in the backend. I tried loading the graph with the StellarGraph package via numpy arrays, with are much more efficient. All of the graph loads within a minute, even on a CPU.

Is there any suggestion you have as to how to better load the data into DeepSNAP? Or could you possibly integrate the ogbl-biokg graph as a dataset into your library, considering the ogb package is also part of snap-stanford ? This would be very helpful!

@zechengz
Copy link
Collaborator

Hi,

Thanks for pointing out this. Right, handling the graph data by using the NetworkX graph object seems not efficient. But the performance issue for generating the HeteroGraph might be mainly caused by what DeepSNAP does internally, transforming the NetworkX graph into tensors and in the link prediction case it will also split multiple negative edges. These can actually cause the performance / memory issue. One potential solution is to not use DeepSNAP if you don't need to manipulate the graph heavily (for example during training). You can use PyG directly with its transforms functions. Also, now the heterogeneous functionality has also been merged into the PyG and you can use it from PyG directly. If you have heavy graph manipulation requirements and need to use the graph algorithm from NetworkX, you can try to feed tensors directly such as this example but I am not sure whether this can work for the link prediction task (also not very sure about the performance). I will benchmark and try to find the performance issue if I have time recently.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants