Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

results: define output format/schema #721

Open
mr-tz opened this issue May 5, 2023 · 9 comments
Open

results: define output format/schema #721

mr-tz opened this issue May 5, 2023 · 9 comments
Labels
QS QUANTUMSTRAND

Comments

@mr-tz
Copy link
Collaborator

mr-tz commented May 5, 2023

to store and exchange results we'll need a new output schema, likely json

the UI will render this data (or parts of it, when they become available although this should be quick)

again, likely an array of objects (combining all other keys from the databases?) should work here

@mr-tz mr-tz added the QS QUANTUMSTRAND label May 5, 2023
@ooprathamm
Copy link
Contributor

@mr-tz I have worked on adding new format to parse output json back to capa in the past [PR_#1396]. Can I look into this ?

@mr-tz
Copy link
Collaborator Author

mr-tz commented Mar 11, 2024

Sounds great, please take a look and let's discuss if you have any questions or a design draft.

@ooprathamm
Copy link
Contributor

ooprathamm commented Mar 12, 2024

(combining all other keys from the databases?) should work here

Could you please shed some light on this one.

@williballenthin
Copy link
Collaborator

QS uses a bunch of embedded databases to provide context about strings. Things like prevalence, library, version, etc. So all the information from each database should be merged into records about each recovered string.

@ooprathamm
Copy link
Contributor

@williballenthin @mr-tz for further discussion and inputs, I have created a new PR #972 :)

@mr-tz
Copy link
Collaborator Author

mr-tz commented Mar 22, 2024

Hi @ooprathamm, pulling the discussion to this issue.

On a higher design level we'll have to see how we want to deal with structure vs. tagged strings vs. other functionality. Ideally, we can decouple the storage and logic a bit. The current POC implementation is quite elegant but IMO combines multiple features potentially complication further work. On the other hand, we may keep the extraction logic and just change the resulting document.

In my head I currently have something like (based on some of your work, here, thanks!):

{
    "strings": {
        "static_strings": [
            {
                "string": {
                    "encoding": "ascii",
                    "slice": {
                        "range": {
                            "length": 40,
                            "offset": 77
                        }
                    },
                    "string": "!This program cannot be run in DOS mode."
                },
                "structure": "pe.header",
                "tags": [
                    "#common"
                ]
            },
            {
                "string": {
                    "encoding": "ascii",
                    "slice": {
                        "range": {
                            "length": 12,
                            "offset": 11644
                        }
                    },
                    "string": "VirtualQuery"
                },
                "structure": "import table",
                "tags": [
                    "#winapi",
                    "#common"
                ]
            }
        ]
    }
}

And/or we add a meta section storing the optional layout (PE, ELF) of a file.

This may require further discussion and be a larger effort but I'd be curious to hear your thoughts.

@williballenthin
Copy link
Collaborator

williballenthin commented Mar 22, 2024

Thanks for re-sparking this discussion @mr-tz.

I think things like: location, length, encoding, and content of the string is part of the definition of the (static) string and should be at the top level. Or under .string exactly as @mr-tz proposes.

Other information, like: structure, tags, and prevalence are more like "context" - things we assess about the string beyond its definition. I suspect each database/algorithm can provide its own context and we haven't explored all of them yet. So maybe all this context gets grouped together in an extensible way.

File layout seems orthogonal to (static) strings and probably should be stored separately from the strings. A presentation layer could stitch together all the data and make it look pretty.

@ooprathamm
Copy link
Contributor

Thanks for the review @mr-tz @williballenthin
I agree the current poc restricts further work. Thanks for providing a view on the desired output structure.
I appreciate your detailed explanation. I agree that decoupling the storage and logic could provide us the basis for incorporating advanced features without overcomplicating as done by floss.
Given the points you've raised, I'm eager to incorporate your suggestions into the pull request.

@mr-tz
Copy link
Collaborator Author

mr-tz commented Sep 23, 2024

location, length, encoding, and content of the string is part of the definition of the (static) string / top level or under .string

structure, tags, and prevalence are more like "context" - grouped together in an extensible way.

File layout should be stored separately from the strings.

An alternative representation could then look like this:

{
    "strings": {
        "static_strings": [
            {
                "id": 1
                "encoding": "ascii",
                "offset": 77,
                "length": 40,
                "string": "!This program cannot be run in DOS mode."
            },
            {
                "id": 1337
                "encoding": "ascii",
                "offset": 11644,
                "length": 12,
                "string": "VirtualQuery"
            },
            {
                "id": 9999
                "encoding": "ascii",
                "offset": 123456,
                "length": 6,
                "string": "unique"
            },
        ]
        "context":
        {
            1:
            {
                "structure": "pe.header",
                "tags": [
                    "#common"
                ]
            },
            1337:
            {
                "structure": "import table",
                "tags": [
                    "#winapi",
                    "#common"
                ]
            }
            # no 9999 entry
        }
    },
    "file_layout": {
        ...
    }
}
            

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QS QUANTUMSTRAND
Projects
None yet
Development

No branches or pull requests

3 participants