Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create base training model using a chunksize #9

Open
sshivam95 opened this issue Jun 11, 2024 · 6 comments
Open

Create base training model using a chunksize #9

sshivam95 opened this issue Jun 11, 2024 · 6 comments

Comments

@sshivam95
Copy link
Collaborator

sshivam95 commented Jun 11, 2024

The original idea of using skiprows together with nrows parameter in pandas.read_csv was a bad idea.

Pandas is using a bafflingly memory-intensive way of implementing skiprows. On using skiprows=12_000_000_000, tt is basically doing skiprows = set(list(range(skiprows))). It's building a giant list and a set, each containing 12 billion triples!

@sshivam95
Copy link
Collaborator Author

Use iterator=True option in pandas.read_csv

@sshivam95
Copy link
Collaborator Author

Won't do. Create files and use these files to train models

@sshivam95
Copy link
Collaborator Author

Update: Partition dataset using domains (namespace in XML or the authority part of the base URL). Basically the domain of subject and object should be same. If subject is connected to a blank node so it is in the domain in which the subject is.

@sshivam95
Copy link
Collaborator Author

Update: For blank nodes connected to blank nodes, we have to take care of CBD.

sshivam95 added a commit that referenced this issue Jun 18, 2024
- This includes the dataset which exceeds the main memory as a whole, e.g. rdfa, hcard, microdata, jsonld

- A solution to Issue #9
@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 18, 2024

Update: Domain specific datasets have been created for the following format datasets:
Datasets materialized:

  • adr_dataset
  • hcalendar_dataset
  • hlisting_dataset
  • hresume_dataset
  • rdfa_dataset
  • xfn_dataset
  • geo_dataset
  • hcard_dataset
  • hrecipe_dataset
  • hreview_dataset
  • species_dataset
  • jsonld_dataset
  • microdata_dataset

Link them with wikidata using LIMES and then clean them by removing literals and materializing the blank nodes

@sshivam95
Copy link
Collaborator Author

sshivam95 commented Jun 27, 2024

Update: Discussion with Sherif to minimize the number of KGs by merging the smaller KGs with bigger ones if the subject exists in the bigger one. A threshold need to be found for the following datasets:
Datasets materialized:

  • adr_dataset
  • hcalendar_dataset
  • hlisting_dataset
  • hresume_dataset
  • rdfa_dataset
  • xfn_dataset
  • geo_dataset
  • hcard_dataset
  • hrecipe_dataset
  • hreview_dataset
  • species_dataset
  • jsonld_dataset
  • microdata_dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant