Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove net from keyword blacklist #286

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Remove net from keyword blacklist #286

wants to merge 2 commits into from

Conversation

fstachura
Copy link
Collaborator

I decided to do some stats for this keyword anyway, on 6.8 it is indexed as the 45th most common keyword (below, for example, page) with 27 499 references. The storage impact is negligible, the difference between reference databases is ~100kB.

@tleb
Copy link
Member

tleb commented Jun 6, 2024

Hi, two questions:

  • Can you share full data for all keywords (including those currently blacklisted)?
  • Can we get estimate storage size for each keyword without indexing once with and once without? Or maybe just the storage impact of the first keyword? If it is small enough it means blacklisting is not useful.

@fstachura
Copy link
Collaborator Author

Can you share full data for all keywords (including those currently blacklisted)?

Sure, I started the indexing with an empty list, will post results here when it's done.

Can we get estimate storage size for each keyword without indexing once with and once without? Or maybe just the storage impact of the first keyword? If it is small enough it means blacklisting is not useful.

I will have to read the code more closely to be 100% sure, but my current understanding is that references.db, for each keyword stores a string that contains information about all files (and lines within these files) that reference that particular keyword. Although I'm not yet sure how tags work with that (files seem to be identified by hashes, that probably solves the issue).

If the above is true, I'm not sure if it makes sense to do an index for each keyword on the blacklist. That would take a lot of time. I think that just measuring the size of that string for the most common keyword should approximate the storage impact enough to make a judgement.

I still would put basic C keywords/types like if and while into the blacklist, as I think that even having them clickable just does not make sense.

@fstachura
Copy link
Collaborator Author

fstachura commented Jun 6, 2024

Here is a list of top 500 most referenced keywords in an index created only from Linux 6.8:
https://gist.github.com/fstachura/3c5572fb4fd87a6b1c10d9b2546d1772

Keywords with a star are currently in the blocklist.

The list was created using this script:
https://gist.github.com/fstachura/ce4babfae2988edce466c69c073e268a

Database size without the blacklist:

402M    /srv/elixir-data/linux/data/references.db
443M    /srv/elixir-data/linux/data/definitions.db
8.0K    /srv/elixir-data/linux/data/variables.db
2.3M    /srv/elixir-data/linux/data/compatibledts.db
4.2M    /srv/elixir-data/linux/data/doccomments.db
1.4M    /srv/elixir-data/linux/data/compatibledts_docs.db
8.7M    /srv/elixir-data/linux/data/hashes.db
4.5M    /srv/elixir-data/linux/data/filenames.db
6.5M    /srv/elixir-data/linux/data/blobs.db
3.7M    /srv/elixir-data/linux/data/versions.db
876M    /srv/elixir-data/linux/data

Database size with the blacklist:

333M    /srv/elixir-data/linux/data-old/references.db
443M    /srv/elixir-data/linux/data-old/definitions.db
8.0K    /srv/elixir-data/linux/data-old/variables.db
2.3M    /srv/elixir-data/linux/data-old/compatibledts.db
4.2M    /srv/elixir-data/linux/data-old/doccomments.db
1.4M    /srv/elixir-data/linux/data-old/compatibledts_docs.db
8.7M    /srv/elixir-data/linux/data-old/hashes.db
4.5M    /srv/elixir-data/linux/data-old/filenames.db
6.5M    /srv/elixir-data/linux/data-old/blobs.db
3.7M    /srv/elixir-data/linux/data-old/versions.db
806M    /srv/elixir-data/linux/data-old/

Note that this is only for a single tag. There are 839 tags in https://github.com/torvalds/linux so size gain from completely disabling the blacklist could probably be measured in gigabytes.

Database with empty blocklist is currently up at https://elixir.fbstc.org/linux/latest/source

@tleb
Copy link
Member

tleb commented Jun 11, 2024

Here is the size of all database files. Total database size is 43G. The extract about Linux:

193M   ./linux/data/hashes.db
101M   ./linux/data/filenames.db
4.0G   ./linux/data/definitions.db
146M   ./linux/data/blobs.db
33M    ./linux/data/compatibledts.db
5.1M   ./linux/data/compatibledts_docs.db
8.0K   ./linux/data/variables.db
12G    ./linux/data/versions.db
14G    ./linux/data/references.db
110M   ./linux/data/doccomments.db

Seeing that the production server is not full and high usage tokens (like net) don't seem to explode the database size, it looks reasonable to remove most entries in the blocklist.

Last thing to check: could you index multiple tags without a blocklist and check how database size increases compared to with blocklist? The goal is to be able to extrapolate how the production database would behave. I can run analysis on the production database or even share it with you if that can help in the analysis.

Then it would be nice to read the full blocklist and reassess each word. I'm seeing stuff like skb that would be useful to index, in addition to net.

@fstachura
Copy link
Collaborator Author

Quick update: I started indexing the following tags few days ago: v3.1, v3.15, v3.5.6, v4.10.11, v4.17, v4.6, v5.14.15, v5.4, v5.9, v6.2, v6.8, v6.9.4.

@fstachura
Copy link
Collaborator Author

Sorry that this is taking so long. I hit #292 and at first I thought that maybe this is an issue with caching. Disabling the cache or explicitly syncing and closing the databases did not help.

@fstachura
Copy link
Collaborator Author

Here are the results for a tag by tag update:

Tags: v3.1, v3.15, v3.5.6, v4.10.11, v4.17, v4.6, v5.14.15, v5.4, v5.9, v6.2, v6.8, v6.9.4

Database without a blacklist - 3235 MB
Database with a blacklist - 2724 MB
511 MB difference, ~20% increase in database size.

Top 500 refs without a blacklist:
https://gist.github.com/fstachura/dfde40823cdf503fa6d47bd55589fd0c
Top 500 refs with a blacklist:
https://gist.github.com/fstachura/3fd1ccb565b42e7e280baaa66ab7c7c8

I will pick keywords to remove from the list soon. After the tag-by-tag update the top references lists seem consistent (same number of references for top 500 keywords).
Size differences between tag-by-tag and parallel update databases are not significant, ~20MB.

@tleb
Copy link
Member

tleb commented Jul 3, 2024

OK so clearly we have no reason to exclude symbols that might make sense to search, like struct net and others.

Leave only identifiers related to macros, control flow and type declarations.
Change blacklist into a set.
@fstachura
Copy link
Collaborator Author

@tleb In the latest commit I only added control flow (also in macros), type declaration keywords and some literals. I'm definitely sure control flow should not be indexed. Also, indexing too much makes the visual cue of bolded links totally useless, and I think that makes code harder to read. See https://elixir.fbstc.org/linux/latest/source/init/do_mounts.c

Double underscore seems to be used mainly in macros.

I'm on the fence about static, extern and const - out of these I only think extern could be somewhat interesting, maybe const. register and volatile could maybe go to the "always index" list, maybe inline too.

I'm wondering what you think about built in types. I think they are way too common to be interesting. But I don't know, and Elixir does index some right now, for example bool. Same with uN types (yes, technically not built-in, but very popular).

I don't know about blacklisting variable names. I think almost all variable names could be useful, if elixir had an ability to somehow limit search scope or sort results in other orders than alphabetical.

The rest from the C keywords list seems at least somewhat interesting, but there a big degree of totally personal assessment here.

I don't know about auto and nullptr, these could potentially become used in the kernel at some point.

I haven't added any C++ keywords yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants