Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset stats: include table size #272

Open
dmpetrov opened this issue Aug 11, 2024 · 4 comments
Open

dataset stats: include table size #272

dmpetrov opened this issue Aug 11, 2024 · 4 comments

Comments

@dmpetrov
Copy link
Member

We are storing number of records and file sizes to dataset records which is convenient to have. In addition to this, it would be very. convenient to have size of physical table for a dataset. It's needed because in some cases a table can take 90Gb (laion with 12M files).

@dmpetrov
Copy link
Member Author

@iterative/datachain I'd appreciate your opinion on this. How realistic it is to get this information from the real DBs.

@dreadatour
Copy link
Contributor

How realistic it is to get this information from the real DBs.

Easy.

SQLite

We can use dbstat to get table size:

sqlite> SELECT SUM("pgsize") FROM "dbstat" WHERE name='ds_xxx';
5980180480
sqlite>

Dbstat is only available when SQLite is built using the SQLITE_ENABLE_DBSTAT_VTAB compile-time option. We can check if SQLITE_ENABLE_DBSTAT_VTAB is enabled using PRAGMA compile_options:

sqlite> PRAGMA compile_options;
...
ENABLE_DBSTAT_VTAB
...
sqlite>

ClickHouse

4d6e8a4704d9 :) SELECT sum(bytes) as size FROM system.parts WHERE table = 'ds_xxx';

SELECT sum(bytes) AS size
FROM system.parts
WHERE `table` = 'ds_xxx'

Query id: a8595e26-a522-4d7a-8155-bb3ba4b1c68a

   ┌────────size─┐
1. │ 10973571294 │ -- 10.97 billion
   └─────────────┘

1 row in set. Elapsed: 0.009 sec.

4d6e8a4704d9 :)

@dmpetrov
Copy link
Member Author

how this will look like in the dataset schemas that user will get in API? You recently introduced these, right?

@dreadatour
Copy link
Contributor

So we do have num_objects and size fields in dataset_version table (source code). Here we have a method to update these fields. And here is the code for getting these stats from the DB.

Steps to implement physical table size is:

  1. Update dataset_version table and add size_bytes field (name is questionable). Optional is to have json field (stats?) with all stats in one field (similar to preview field).
  2. Update other methods with respect to new dataset stat field with respect to DB (implementation will be different).

Sounds quite easy to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants