Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Front-end not inserting idents in source page #306

Open
tleb opened this issue Jul 24, 2024 · 6 comments
Open

Front-end not inserting idents in source page #306

tleb opened this issue Jul 24, 2024 · 6 comments
Labels

Comments

@tleb
Copy link
Member

tleb commented Jul 24, 2024

See security/selinux/hooks.c in v6.10 for a page with the bug. The ident def exists, but web has not inserted links inside the document.

Could it be related to #300?

@tleb tleb added the bug label Jul 24, 2024
@tleb tleb changed the title Front-end not showing idents Front-end not inserting idents in source page Jul 24, 2024
@fstachura
Copy link
Collaborator

fstachura commented Jul 24, 2024

Could it be related to #300?

I hope not, https://many-tags-blacklist.elixir.fbstc.org/linux/latest/source/security/selinux/hooks.c#L1073 runs a1bff7a + some new-pygments iteration (notice no redirect on latest) and idents cut off in the same place.

I suspect it's something to do with how filters are designed. Expect a redesign proposal (wall of text) soon.

@fstachura
Copy link
Collaborator

fstachura commented Jul 25, 2024

On v6.10 there is a whole chunk without identifiers, it starts on line 1067 and ends on line 2583

Quick explanation on how formatted code is supplied with identifier links:

  • tokenize-file is a (bash + perl regex) tokenizer that splits a source file into lines. Every other line is (supposed to be) a potential identifier. The rest are comments, operators, generally not interesting.
  • The result of the tokenizer is consumed by query.py. Every other line is looked up in the definitions database. If a match is found, unprintable markings are added at the beginning and end of the line.
  • Lines are then concatenated back into a string
  • The string is formatted into HTML by Pygments
  • Filters are used to find all interesting pieces of code, like identifiers and includes, and replace them with links, in the HTML produced by Pygments. ident is used to find identifiers. (to be precise, filters first replace interesting tokens in untokenized file with their own, special strings and then replace these strings with links in HTML produced by Pygments).

The problem here lies in the tokenizer - it gets confused, probably by the escaped quote, and treats a big chunk of code as a single token (maybe an interesting token, maybe not, didn't count the lines and it doesn't matter). Try ./script.sh tokenize-file v6.9.4 /security/selinux/hooks.c | head -n3640 | tail -n5 (don't forget about LXR_*_DIR env vars). The whole chunk without identifiers should be a single line.

We could of course try to massage the regex to make it support unescaped strings (assuming it even is possible, although AFAIK perl regex is very powerful) but I have an idea how to redesign this whole part (tldr use Pygments lexers and extend Pygments a bit). I will post a full explanation soon.

@tleb
Copy link
Member Author

tleb commented Jul 29, 2024

So the root cause is always the same quote escaping that fails parsing? Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings?

@fstachura
Copy link
Collaborator

So the root cause is always the same quote escaping that fails parsing?

Probably not always, I just wanted to see how often that happens because of quote escaping ('\"' seems to be particularly confusing).

Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings?

I don't understand what you mean. I think it's b - these chunks are multiline strings (single line with newlines escaped) in tokenize-file output.


Also, for another bug caused by the tokenizer getting confused, see https://elixir.bootlin.com/u-boot/v2023.10/source/fs/ext4/ext4_common.c#L33. Include with quotes after an include with angle brackets is treated as an identifier.

root@6756ab77f4b5:/srv/elixir-data# /usr/local/elixir/script.sh tokenize-file v2023.10 /fs/ext4/ext4_common.c C | head -n5
// SPDX-License-Identifier: GPL-2.0+/* * (C) Copyright 2011 - 2012 Samsung Electronics * EXT4 filesystem implementation in Uboot by * Uma Shankar <[email protected]> * Manjunatha C Achar <[email protected]> * * ext4ls and ext4load : Based on ext2 ls load support in Uboot. * * (C) Copyright 2004 * esd gmbh <www.esd-electronics.com> * Reinhard Arlt <[email protected]> * * based on code from grub2 fs/ext2.c and fs/fshelp.c by * GRUB  --  GRand Unified Bootloader * Copyright (C) 2003, 2004  Free Software Foundation, Inc. * * ext4write : Based on generic ext4 protocol. */#include <common.h>#include <blk.h>#include <ext_common.h>#include <ext4fs.h>#include <log.h>#include <malloc.h>#include <memalign.h>#include <part.h>#include <stddef.h>#include <linux/stat.h>#include <linux/time.h>#include <asm/byteorder.h>#
include
 "ext4_common.h"
struct

@tleb
Copy link
Member Author

tleb commented Jul 31, 2024

Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings?

I don't understand what you mean. I think it's b - these chunks are multiline strings (single line with newlines escaped) in tokenize-file output.

The idea was a sort of mitigation: we know C does not support multi-line strings. That means we know those multiline chunks can only be a bug reported by our tokenizer (when parsing C code). We could modify the tokenizer to reset its state at newline.

That workaround would only make sense if we plan on expanding on the current tokenizer script. Your #307 proposal makes more sense to my eyes though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants