-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove net from keyword blacklist #286
base: master
Are you sure you want to change the base?
Conversation
Hi, two questions:
|
Sure, I started the indexing with an empty list, will post results here when it's done.
I will have to read the code more closely to be 100% sure, but my current understanding is that references.db, for each keyword stores a string that contains information about all files (and lines within these files) that reference that particular keyword. Although I'm not yet sure how tags work with that (files seem to be identified by hashes, that probably solves the issue). If the above is true, I'm not sure if it makes sense to do an index for each keyword on the blacklist. That would take a lot of time. I think that just measuring the size of that string for the most common keyword should approximate the storage impact enough to make a judgement. I still would put basic C keywords/types like if and while into the blacklist, as I think that even having them clickable just does not make sense. |
Here is a list of top 500 most referenced keywords in an index created only from Linux 6.8: Keywords with a star are currently in the blocklist. The list was created using this script: Database size without the blacklist:
Database size with the blacklist:
Note that this is only for a single tag. There are 839 tags in https://github.com/torvalds/linux so size gain from completely disabling the blacklist could probably be measured in gigabytes. Database with empty blocklist is currently up at https://elixir.fbstc.org/linux/latest/source |
Here is the size of all database files. Total database size is 43G. The extract about Linux:
Seeing that the production server is not full and high usage tokens (like Last thing to check: could you index multiple tags without a blocklist and check how database size increases compared to with blocklist? The goal is to be able to extrapolate how the production database would behave. I can run analysis on the production database or even share it with you if that can help in the analysis. Then it would be nice to read the full blocklist and reassess each word. I'm seeing stuff like |
Quick update: I started indexing the following tags few days ago: v3.1, v3.15, v3.5.6, v4.10.11, v4.17, v4.6, v5.14.15, v5.4, v5.9, v6.2, v6.8, v6.9.4. |
Sorry that this is taking so long. I hit #292 and at first I thought that maybe this is an issue with caching. Disabling the cache or explicitly syncing and closing the databases did not help. |
Here are the results for a tag by tag update: Tags: v3.1, v3.15, v3.5.6, v4.10.11, v4.17, v4.6, v5.14.15, v5.4, v5.9, v6.2, v6.8, v6.9.4 Database without a blacklist - 3235 MB Top 500 refs without a blacklist: I will pick keywords to remove from the list soon. After the tag-by-tag update the top references lists seem consistent (same number of references for top 500 keywords). |
OK so clearly we have no reason to exclude symbols that might make sense to search, like |
Leave only identifiers related to macros, control flow and type declarations. Change blacklist into a set.
@tleb In the latest commit I only added control flow (also in macros), type declaration keywords and some literals. I'm definitely sure control flow should not be indexed. Also, indexing too much makes the visual cue of bolded links totally useless, and I think that makes code harder to read. See https://elixir.fbstc.org/linux/latest/source/init/do_mounts.c Double underscore seems to be used mainly in macros. I'm on the fence about static, extern and const - out of these I only think extern could be somewhat interesting, maybe const. register and volatile could maybe go to the "always index" list, maybe inline too. I'm wondering what you think about built in types. I think they are way too common to be interesting. But I don't know, and Elixir does index some right now, for example bool. Same with uN types (yes, technically not built-in, but very popular). I don't know about blacklisting variable names. I think almost all variable names could be useful, if elixir had an ability to somehow limit search scope or sort results in other orders than alphabetical. The rest from the C keywords list seems at least somewhat interesting, but there a big degree of totally personal assessment here. I don't know about auto and nullptr, these could potentially become used in the kernel at some point. I haven't added any C++ keywords yet. |
I decided to do some stats for this keyword anyway, on 6.8 it is indexed as the 45th most common keyword (below, for example, page) with 27 499 references. The storage impact is negligible, the difference between reference databases is ~100kB.