Front-end not inserting idents in source page #306

tleb · 2024-07-24T16:40:47Z

See security/selinux/hooks.c in v6.10 for a page with the bug. The ident def exists, but web has not inserted links inside the document.

Could it be related to #300?

fstachura · 2024-07-24T17:55:49Z

Could it be related to #300?

I hope not, https://many-tags-blacklist.elixir.fbstc.org/linux/latest/source/security/selinux/hooks.c#L1073 runs a1bff7a + some new-pygments iteration (notice no redirect on latest) and idents cut off in the same place.

I suspect it's something to do with how filters are designed. Expect a redesign proposal (wall of text) soon.

fstachura · 2024-07-25T12:17:25Z

On v6.10 there is a whole chunk without identifiers, it starts on line 1067 and ends on line 2583

Quick explanation on how formatted code is supplied with identifier links:

tokenize-file is a (bash + perl regex) tokenizer that splits a source file into lines. Every other line is (supposed to be) a potential identifier. The rest are comments, operators, generally not interesting.
The result of the tokenizer is consumed by query.py. Every other line is looked up in the definitions database. If a match is found, unprintable markings are added at the beginning and end of the line.
Lines are then concatenated back into a string
The string is formatted into HTML by Pygments
Filters are used to find all interesting pieces of code, like identifiers and includes, and replace them with links, in the HTML produced by Pygments. ident is used to find identifiers. (to be precise, filters first replace interesting tokens in untokenized file with their own, special strings and then replace these strings with links in HTML produced by Pygments).

The problem here lies in the tokenizer - it gets confused, probably by the escaped quote, and treats a big chunk of code as a single token (maybe an interesting token, maybe not, didn't count the lines and it doesn't matter). Try ./script.sh tokenize-file v6.9.4 /security/selinux/hooks.c | head -n3640 | tail -n5 (don't forget about LXR_*_DIR env vars). The whole chunk without identifiers should be a single line.

We could of course try to massage the regex to make it support unescaped strings (assuming it even is possible, although AFAIK perl regex is very powerful) but I have an idea how to redesign this whole part (tldr use Pygments lexers and extend Pygments a bit). I will post a full explanation soon.

fstachura · 2024-07-29T13:29:09Z

Just out of curiosity, I decided to try and find more cases of this:

tleb · 2024-07-29T15:47:59Z

So the root cause is always the same quote escaping that fails parsing? Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings?

fstachura · 2024-07-30T17:10:08Z

So the root cause is always the same quote escaping that fails parsing?

Probably not always, I just wanted to see how often that happens because of quote escaping ('\"' seems to be particularly confusing).

Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings?

I don't understand what you mean. I think it's b - these chunks are multiline strings (single line with newlines escaped) in tokenize-file output.

Also, for another bug caused by the tokenizer getting confused, see https://elixir.bootlin.com/u-boot/v2023.10/source/fs/ext4/ext4_common.c#L33. Include with quotes after an include with angle brackets is treated as an identifier.

root@6756ab77f4b5:/srv/elixir-data# /usr/local/elixir/script.sh tokenize-file v2023.10 /fs/ext4/ext4_common.c C | head -n5
// SPDX-License-Identifier: GPL-2.0+/* * (C) Copyright 2011 - 2012 Samsung Electronics * EXT4 filesystem implementation in Uboot by * Uma Shankar <[email protected]> * Manjunatha C Achar <[email protected]> * * ext4ls and ext4load : Based on ext2 ls load support in Uboot. * * (C) Copyright 2004 * esd gmbh <www.esd-electronics.com> * Reinhard Arlt <[email protected]> * * based on code from grub2 fs/ext2.c and fs/fshelp.c by * GRUB  --  GRand Unified Bootloader * Copyright (C) 2003, 2004  Free Software Foundation, Inc. * * ext4write : Based on generic ext4 protocol. */#include <common.h>#include <blk.h>#include <ext_common.h>#include <ext4fs.h>#include <log.h>#include <malloc.h>#include <memalign.h>#include <part.h>#include <stddef.h>#include <linux/stat.h>#include <linux/time.h>#include <asm/byteorder.h>#
include
 "ext4_common.h"
struct

tleb · 2024-07-31T08:32:18Z

Could we reset stuff at EOL so that it restarts parsing references on the next line? Or does it think those are valid multiline strings?

I don't understand what you mean. I think it's b - these chunks are multiline strings (single line with newlines escaped) in tokenize-file output.

The idea was a sort of mitigation: we know C does not support multi-line strings. That means we know those multiline chunks can only be a bug reported by our tokenizer (when parsing C code). We could modify the tokenizer to reset its state at newline.

That workaround would only make sense if we plan on expanding on the current tokenizer script. Your #307 proposal makes more sense to my eyes though.

tleb added the bug label Jul 24, 2024

tleb changed the title ~~Front-end not showing idents~~ Front-end not inserting idents in source page Jul 24, 2024

fstachura mentioned this issue Jul 29, 2024

Proposal - redesign how Elixir adds links to source code formatted by Pygments #307

Open

fstachura mentioned this issue Aug 8, 2024

Add systemd source code #284

Open

fstachura mentioned this issue Aug 20, 2024

Always pick GAS Lexer for .S files #316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Front-end not inserting idents in source page #306

Front-end not inserting idents in source page #306

tleb commented Jul 24, 2024

fstachura commented Jul 24, 2024 •

edited

Loading

fstachura commented Jul 25, 2024 •

edited

Loading

fstachura commented Jul 29, 2024

tleb commented Jul 29, 2024

fstachura commented Jul 30, 2024

tleb commented Jul 31, 2024

Front-end not inserting idents in source page #306

Front-end not inserting idents in source page #306

Comments

tleb commented Jul 24, 2024

fstachura commented Jul 24, 2024 • edited Loading

fstachura commented Jul 25, 2024 • edited Loading

fstachura commented Jul 29, 2024

tleb commented Jul 29, 2024

fstachura commented Jul 30, 2024

tleb commented Jul 31, 2024

fstachura commented Jul 24, 2024 •

edited

Loading

fstachura commented Jul 25, 2024 •

edited

Loading