-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create base training model using a chunksize #9
Comments
Use |
Won't do. Create files and use these files to train models |
Update: Partition dataset using domains (namespace in XML or the authority part of the base URL). Basically the domain of subject and object should be same. If |
Update: For blank nodes connected to blank nodes, we have to take care of CBD. |
- This includes the dataset which exceeds the main memory as a whole, e.g. rdfa, hcard, microdata, jsonld - A solution to Issue #9
Update: Domain specific datasets have been created for the following format datasets:
Link them with wikidata using LIMES and then clean them by removing literals and materializing the blank nodes |
Update: Discussion with Sherif to minimize the number of KGs by merging the smaller KGs with bigger ones if the subject exists in the bigger one. A threshold need to be found for the following datasets:
|
The original idea of using
skiprows
together withnrows
parameter inpandas.read_csv
was a bad idea.Pandas is using a bafflingly memory-intensive way of implementing
skiprows
. On usingskiprows=12_000_000_000
, tt is basically doingskiprows = set(list(range(skiprows)))
. It's building a giant list and a set, each containing 12 billion triples!The text was updated successfully, but these errors were encountered: