Remove net from keyword blacklist #286

fstachura · 2024-06-06T09:37:52Z

I decided to do some stats for this keyword anyway, on 6.8 it is indexed as the 45th most common keyword (below, for example, page) with 27 499 references. The storage impact is negligible, the difference between reference databases is ~100kB.

tleb · 2024-06-06T10:23:01Z

Hi, two questions:

Can you share full data for all keywords (including those currently blacklisted)?
Can we get estimate storage size for each keyword without indexing once with and once without? Or maybe just the storage impact of the first keyword? If it is small enough it means blacklisting is not useful.

fstachura · 2024-06-06T11:05:04Z

Can you share full data for all keywords (including those currently blacklisted)?

Sure, I started the indexing with an empty list, will post results here when it's done.

Can we get estimate storage size for each keyword without indexing once with and once without? Or maybe just the storage impact of the first keyword? If it is small enough it means blacklisting is not useful.

I will have to read the code more closely to be 100% sure, but my current understanding is that references.db, for each keyword stores a string that contains information about all files (and lines within these files) that reference that particular keyword. Although I'm not yet sure how tags work with that (files seem to be identified by hashes, that probably solves the issue).

If the above is true, I'm not sure if it makes sense to do an index for each keyword on the blacklist. That would take a lot of time. I think that just measuring the size of that string for the most common keyword should approximate the storage impact enough to make a judgement.

I still would put basic C keywords/types like if and while into the blacklist, as I think that even having them clickable just does not make sense.

fstachura · 2024-06-06T19:44:51Z

Here is a list of top 500 most referenced keywords in an index created only from Linux 6.8:
https://gist.github.com/fstachura/3c5572fb4fd87a6b1c10d9b2546d1772

Keywords with a star are currently in the blocklist.

The list was created using this script:
https://gist.github.com/fstachura/ce4babfae2988edce466c69c073e268a

Database size without the blacklist:

402M    /srv/elixir-data/linux/data/references.db
443M    /srv/elixir-data/linux/data/definitions.db
8.0K    /srv/elixir-data/linux/data/variables.db
2.3M    /srv/elixir-data/linux/data/compatibledts.db
4.2M    /srv/elixir-data/linux/data/doccomments.db
1.4M    /srv/elixir-data/linux/data/compatibledts_docs.db
8.7M    /srv/elixir-data/linux/data/hashes.db
4.5M    /srv/elixir-data/linux/data/filenames.db
6.5M    /srv/elixir-data/linux/data/blobs.db
3.7M    /srv/elixir-data/linux/data/versions.db
876M    /srv/elixir-data/linux/data

Database size with the blacklist:

333M    /srv/elixir-data/linux/data-old/references.db
443M    /srv/elixir-data/linux/data-old/definitions.db
8.0K    /srv/elixir-data/linux/data-old/variables.db
2.3M    /srv/elixir-data/linux/data-old/compatibledts.db
4.2M    /srv/elixir-data/linux/data-old/doccomments.db
1.4M    /srv/elixir-data/linux/data-old/compatibledts_docs.db
8.7M    /srv/elixir-data/linux/data-old/hashes.db
4.5M    /srv/elixir-data/linux/data-old/filenames.db
6.5M    /srv/elixir-data/linux/data-old/blobs.db
3.7M    /srv/elixir-data/linux/data-old/versions.db
806M    /srv/elixir-data/linux/data-old/

Note that this is only for a single tag. There are 839 tags in https://github.com/torvalds/linux so size gain from completely disabling the blacklist could probably be measured in gigabytes.

Database with empty blocklist is currently up at https://elixir.fbstc.org/linux/latest/source

tleb · 2024-06-11T09:28:18Z

Here is the size of all database files. Total database size is 43G. The extract about Linux:

193M   ./linux/data/hashes.db
101M   ./linux/data/filenames.db
4.0G   ./linux/data/definitions.db
146M   ./linux/data/blobs.db
33M    ./linux/data/compatibledts.db
5.1M   ./linux/data/compatibledts_docs.db
8.0K   ./linux/data/variables.db
12G    ./linux/data/versions.db
14G    ./linux/data/references.db
110M   ./linux/data/doccomments.db

Seeing that the production server is not full and high usage tokens (like net) don't seem to explode the database size, it looks reasonable to remove most entries in the blocklist.

Last thing to check: could you index multiple tags without a blocklist and check how database size increases compared to with blocklist? The goal is to be able to extrapolate how the production database would behave. I can run analysis on the production database or even share it with you if that can help in the analysis.

Then it would be nice to read the full blocklist and reassess each word. I'm seeing stuff like skb that would be useful to index, in addition to net.

fstachura · 2024-06-14T17:21:37Z

Quick update: I started indexing the following tags few days ago: v3.1, v3.15, v3.5.6, v4.10.11, v4.17, v4.6, v5.14.15, v5.4, v5.9, v6.2, v6.8, v6.9.4.

fstachura · 2024-06-25T13:29:05Z

Sorry that this is taking so long. I hit #292 and at first I thought that maybe this is an issue with caching. Disabling the cache or explicitly syncing and closing the databases did not help.

fstachura · 2024-07-02T16:21:45Z

Here are the results for a tag by tag update:

Tags: v3.1, v3.15, v3.5.6, v4.10.11, v4.17, v4.6, v5.14.15, v5.4, v5.9, v6.2, v6.8, v6.9.4

Database without a blacklist - 3235 MB
Database with a blacklist - 2724 MB
511 MB difference, ~20% increase in database size.

Top 500 refs without a blacklist:
https://gist.github.com/fstachura/dfde40823cdf503fa6d47bd55589fd0c
Top 500 refs with a blacklist:
https://gist.github.com/fstachura/3fd1ccb565b42e7e280baaa66ab7c7c8

I will pick keywords to remove from the list soon. After the tag-by-tag update the top references lists seem consistent (same number of references for top 500 keywords).
Size differences between tag-by-tag and parallel update databases are not significant, ~20MB.

tleb · 2024-07-03T09:46:19Z

OK so clearly we have no reason to exclude symbols that might make sense to search, like struct net and others.

Leave only identifiers related to macros, control flow and type declarations. Change blacklist into a set.

fstachura · 2024-07-10T12:05:12Z

@tleb In the latest commit I only added control flow (also in macros), type declaration keywords and some literals. I'm definitely sure control flow should not be indexed. Also, indexing too much makes the visual cue of bolded links totally useless, and I think that makes code harder to read. See https://elixir.fbstc.org/linux/latest/source/init/do_mounts.c

Double underscore seems to be used mainly in macros.

I'm on the fence about static, extern and const - out of these I only think extern could be somewhat interesting, maybe const. register and volatile could maybe go to the "always index" list, maybe inline too.

I'm wondering what you think about built in types. I think they are way too common to be interesting. But I don't know, and Elixir does index some right now, for example bool. Same with uN types (yes, technically not built-in, but very popular).

I don't know about blacklisting variable names. I think almost all variable names could be useful, if elixir had an ability to somehow limit search scope or sort results in other orders than alphabetical.

The rest from the C keywords list seems at least somewhat interesting, but there a big degree of totally personal assessment here.

I don't know about auto and nullptr, these could potentially become used in the kernel at some point.

I haven't added any C++ keywords yet.

Remove net from keyword blacklist

af0fecc

tleb mentioned this pull request Jun 10, 2024

_Generic() is not handled on elixir.bootlin.com #237

Open

Remove most identifiers from blacklist

719b55f

Leave only identifiers related to macros, control flow and type declarations. Change blacklist into a set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove net from keyword blacklist #286

Remove net from keyword blacklist #286

fstachura commented Jun 6, 2024

tleb commented Jun 6, 2024

fstachura commented Jun 6, 2024

fstachura commented Jun 6, 2024 •

edited

Loading

tleb commented Jun 11, 2024

fstachura commented Jun 14, 2024

fstachura commented Jun 25, 2024

fstachura commented Jul 2, 2024

tleb commented Jul 3, 2024

fstachura commented Jul 10, 2024

Remove net from keyword blacklist #286

Are you sure you want to change the base?

Remove net from keyword blacklist #286

Conversation

fstachura commented Jun 6, 2024

tleb commented Jun 6, 2024

fstachura commented Jun 6, 2024

fstachura commented Jun 6, 2024 • edited Loading

tleb commented Jun 11, 2024

fstachura commented Jun 14, 2024

fstachura commented Jun 25, 2024

fstachura commented Jul 2, 2024

tleb commented Jul 3, 2024

fstachura commented Jul 10, 2024

fstachura commented Jun 6, 2024 •

edited

Loading