Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Description #11

Open
Tianqi-py opened this issue Nov 25, 2022 · 6 comments
Open

Dataset Description #11

Tianqi-py opened this issue Nov 25, 2022 · 6 comments

Comments

@Tianqi-py
Copy link

Hi there,

I was analyzing the graph dataset SIDER used in this paper and had difficulty understanding how the adj matrix is used in the model.

For example, the train adj has 1141 rows, where each row corresponds to one training data point. But each row has a different length; they are all zeros and ones. Could you explain how the adj matrix is saved here? or maybe add a dataset description file in the repo.

And also, how is the adj matrix split? In the classification task where the features from valid and test data are used to generate the representation of the training data, the adj_train should be asymmetrical and directed.

Thanks for your help in advance!

@jacklanchantin
Copy link
Collaborator

the adjacency matrix processing is done here: https://github.com/QData/LaMP/blob/master/utils/utils.py#L86

does that help?

@Tianqi-py
Copy link
Author

Thanks for your quick reply:) I understand the full adjacency matrix is symmetrical and generated by this function. Could you please explain what do the lines in the data["train"]["adj"] mean?

@jacklanchantin
Copy link
Collaborator

That's the train split adjacency matrix (should be either a full adjacency matrix or sparse representation).

@Tianqi-py
Copy link
Author

Thanks again! they are not full adj matrix which should be (1141, 1141) for training data... Is there any chance you could tell me which in what sparse form are they saved? I have difficulty interpreting this matrix...

@jacklanchantin
Copy link
Collaborator

There are 1,141 samples (see table 5 in paper)

See the adj_insts var in DataLoader. That's what sider uses

@Tianqi-py
Copy link
Author

Thanks for your help:) after checking the code I figure out my confusion. Just for future reference if anybody else is confused about the adj matrix:

As mentioned in the paper, LaMP can make use of the original graph structure for message passing. SIDER is the dataset with a prior graph structure. Normally, the adj matrix of a graph summarizes the graph structure and has the shape of (n,n), with n being the number of nodes in a graph. If the adj matrix is too big, there are many sparse formats to save it.

Particularly, in the implementation of LaMP, the adj matrix is saved as a list, and each element in the list is corresponding to the adj matrix of one node, which explains why each line in the adj matrix has a different length(nodes have different numbers of neighbors). The function "construct_adj_mat" in dataloader.py will convert each line(1d) in the adj matrix into a 2d adj matrix. The final adj matrix used by the model is adj_insts, which is a list of 2d adj matrices with different shapes.

Please let me know if there is any wrongly interpreted idea in the understanding :)
Thanks again for your help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants