results: define output format/schema #721

mr-tz · 2023-05-05T06:34:25Z

to store and exchange results we'll need a new output schema, likely json

the UI will render this data (or parts of it, when they become available although this should be quick)

again, likely an array of objects (combining all other keys from the databases?) should work here

ooprathamm · 2024-03-11T13:38:17Z

@mr-tz I have worked on adding new format to parse output json back to capa in the past [PR_#1396]. Can I look into this ?

mr-tz · 2024-03-11T14:00:04Z

Sounds great, please take a look and let's discuss if you have any questions or a design draft.

ooprathamm · 2024-03-12T18:36:48Z

(combining all other keys from the databases?) should work here

Could you please shed some light on this one.

williballenthin · 2024-03-13T06:08:18Z

QS uses a bunch of embedded databases to provide context about strings. Things like prevalence, library, version, etc. So all the information from each database should be merged into records about each recovered string.

ooprathamm · 2024-03-14T18:06:30Z

@williballenthin @mr-tz for further discussion and inputs, I have created a new PR #972 :)

mr-tz · 2024-03-22T09:28:06Z

Hi @ooprathamm, pulling the discussion to this issue.

On a higher design level we'll have to see how we want to deal with structure vs. tagged strings vs. other functionality. Ideally, we can decouple the storage and logic a bit. The current POC implementation is quite elegant but IMO combines multiple features potentially complication further work. On the other hand, we may keep the extraction logic and just change the resulting document.

In my head I currently have something like (based on some of your work, here, thanks!):

{
    "strings": {
        "static_strings": [
            {
                "string": {
                    "encoding": "ascii",
                    "slice": {
                        "range": {
                            "length": 40,
                            "offset": 77
                        }
                    },
                    "string": "!This program cannot be run in DOS mode."
                },
                "structure": "pe.header",
                "tags": [
                    "#common"
                ]
            },
            {
                "string": {
                    "encoding": "ascii",
                    "slice": {
                        "range": {
                            "length": 12,
                            "offset": 11644
                        }
                    },
                    "string": "VirtualQuery"
                },
                "structure": "import table",
                "tags": [
                    "#winapi",
                    "#common"
                ]
            }
        ]
    }
}

And/or we add a meta section storing the optional layout (PE, ELF) of a file.

This may require further discussion and be a larger effort but I'd be curious to hear your thoughts.

williballenthin · 2024-03-22T12:57:41Z

Thanks for re-sparking this discussion @mr-tz.

I think things like: location, length, encoding, and content of the string is part of the definition of the (static) string and should be at the top level. Or under .string exactly as @mr-tz proposes.

Other information, like: structure, tags, and prevalence are more like "context" - things we assess about the string beyond its definition. I suspect each database/algorithm can provide its own context and we haven't explored all of them yet. So maybe all this context gets grouped together in an extensible way.

File layout seems orthogonal to (static) strings and probably should be stored separately from the strings. A presentation layer could stitch together all the data and make it look pretty.

ooprathamm · 2024-03-23T07:59:24Z

Thanks for the review @mr-tz @williballenthin
I agree the current poc restricts further work. Thanks for providing a view on the desired output structure.
I appreciate your detailed explanation. I agree that decoupling the storage and logic could provide us the basis for incorporating advanced features without overcomplicating as done by floss.
Given the points you've raised, I'm eager to incorporate your suggestions into the pull request.

mr-tz · 2024-09-23T05:14:14Z

location, length, encoding, and content of the string is part of the definition of the (static) string / top level or under .string

structure, tags, and prevalence are more like "context" - grouped together in an extensible way.

File layout should be stored separately from the strings.

An alternative representation could then look like this:

{
    "strings": {
        "static_strings": [
            {
                "id": 1
                "encoding": "ascii",
                "offset": 77,
                "length": 40,
                "string": "!This program cannot be run in DOS mode."
            },
            {
                "id": 1337
                "encoding": "ascii",
                "offset": 11644,
                "length": 12,
                "string": "VirtualQuery"
            },
            {
                "id": 9999
                "encoding": "ascii",
                "offset": 123456,
                "length": 6,
                "string": "unique"
            },
        ]
        "context":
        {
            1:
            {
                "structure": "pe.header",
                "tags": [
                    "#common"
                ]
            },
            1337:
            {
                "structure": "import table",
                "tags": [
                    "#winapi",
                    "#common"
                ]
            }
            # no 9999 entry
        }
    },
    "file_layout": {
        ...
    }
}

mr-tz added the QS QUANTUMSTRAND label May 5, 2023

ooprathamm mentioned this issue Mar 14, 2024

QS: Output Schema #972

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

results: define output format/schema #721

results: define output format/schema #721

mr-tz commented May 5, 2023

ooprathamm commented Mar 11, 2024

mr-tz commented Mar 11, 2024

ooprathamm commented Mar 12, 2024 •

edited

Loading

williballenthin commented Mar 13, 2024

ooprathamm commented Mar 14, 2024

mr-tz commented Mar 22, 2024

williballenthin commented Mar 22, 2024 •

edited

Loading

ooprathamm commented Mar 23, 2024

mr-tz commented Sep 23, 2024 •

edited

Loading

results: define output format/schema #721

results: define output format/schema #721

Comments

mr-tz commented May 5, 2023

ooprathamm commented Mar 11, 2024

mr-tz commented Mar 11, 2024

ooprathamm commented Mar 12, 2024 • edited Loading

williballenthin commented Mar 13, 2024

ooprathamm commented Mar 14, 2024

mr-tz commented Mar 22, 2024

williballenthin commented Mar 22, 2024 • edited Loading

ooprathamm commented Mar 23, 2024

mr-tz commented Sep 23, 2024 • edited Loading

ooprathamm commented Mar 12, 2024 •

edited

Loading

williballenthin commented Mar 22, 2024 •

edited

Loading

mr-tz commented Sep 23, 2024 •

edited

Loading