Samsung · babenek · Aug 14, 2024 · Aug 12, 2024 · Aug 12, 2024 · Aug 12, 2024
@@ -132,31 +132,31 @@ In order to compose an accurate Ground Truth set, we proceed data review based o
 Metadata includes Ground Truth values and additional information for credential lines detected by various tools.
 
 ### Properties on the Metadata
-| Name of property   | Data Type | Description                                                                                                                                  |
-|--------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------|
-| Id                 | Integer   | Credential ID                                                                                                                                |
-| FileID             | String    | Filename hash. Used to download correct file from a external repo                                                                            |
-| Domain             | String    | Domain of repository. (ex. Github)                                                                                                           |
-| RepoName           | String    | Project name that credential was found                                                                                                       |
-| FilePath           | String    | File path where credential information was included                                                                                          |
-| LineStart          | Integer   | Line start in file from 1, like in most editors. In common cases it equals LineEnd.                                                          |
-| LineEnd            | Integer   | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this.                                    |
-| GroundTruth        | String    | Ground Truth of this credential. True (T) / False (F) or Template                                                                            |
-| WithWords          | Boolean   | Flag to indicate word(https://github.com/first20hours/google-10000-english) is included on the credential                                    |
-| ValueStart         | Integer   | Index of value on the line after lstrip(). This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
-| ValueEnd           | Integer   | Index of character right after value ends in the line (after lstrip). This is end position on LineEnd. May be -1 or empty.                   |
-| InURL              | Boolean   | Flag to indicate if credential is a part of a URL, such as "http://user:[email protected]"                                                        |
-| InRuntimeParameter | Boolean   | Flag to indicate if credential is in runtime parameter                                                                                       |
-| CharacterSet       | String    | Characters used in the credential (NumberOnly, CharOnly, Any)                                                                                |
-| CryptographyKey    | String    | Type of a key: Private or Public                                                                                                             |
-| PredefinedPattern  | String    | Credential with defined regex patterns (AWS token with `AKIA...` pattern)                                                                    |
-| VariableNameType   | String    | Categorize credentials by variable name into Secret, Key, Token, SeedSalt and Auth                                                           |
-| Entropy            | Float     | Shanon entropy of a credential                                                                                                               |
-| Length             | Integer   | Value length, similar to ValueEnd - ValueStart                                                                                               |
-| Base64Encode       | Boolean   | Is credential a base64 string?                                                                                                               |
-| HexEncode          | Boolean   | Is credential a hex encoded string? (like `\xFF` or `FF 02 33`)                                                                              |
-| URLEncode          | Boolean   | Is credential a url encoded string? (like `one%20two`)                                                                                       |
-| Category           | String    | Labeled data according CredSweeper rules. see [Category](#category).                                                                         |
+| Name of property   | Data Type | Description                                                                                                                                              |
+|--------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Id                 | Integer   | Credential ID                                                                                                                                            |
+| FileID             | String    | Filename hash. Used to download correct file from a external repo                                                                                        |
+| Domain             | String    | Domain of repository. (ex. Github)                                                                                                                       |
+| RepoName           | String    | Project name that credential was found                                                                                                                   |
+| FilePath           | String    | File path where credential information was included                                                                                                      |
+| LineStart          | Integer   | Line start in file from 1, like in most editors. In common cases it equals LineEnd.                                                                      |
+| LineEnd            | Integer   | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this.                                                |
+| GroundTruth        | String    | Ground Truth of this credential. True (T) / False (F) or Template                                                                                        |
+| WithWords          | Boolean   | Flag to indicate word(https://github.com/first20hours/google-10000-english) is included on the credential                                                |
+| ValueStart         | Integer   | Index of value on the line like in CredSweeper report. This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
+| ValueEnd           | Integer   | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty.                               |
+| InURL              | Boolean   | Flag to indicate if credential is a part of a URL, such as "http://user:[email protected]"                                                                    |
+| InRuntimeParameter | Boolean   | Flag to indicate if credential is in runtime parameter                                                                                                   |
+| CharacterSet       | String    | Characters used in the credential (NumberOnly, CharOnly, Any)                                                                                            |
+| CryptographyKey    | String    | Type of a key: Private or Public                                                                                                                         |
+| PredefinedPattern  | String    | Credential with defined regex patterns (AWS token with `AKIA...` pattern)                                                                                |
+| VariableNameType   | String    | Categorize credentials by variable name into Secret, Key, Token, SeedSalt and Auth                                                                       |
+| Entropy            | Float     | Shanon entropy of a credential                                                                                                                           |
+| Length             | Integer   | Value length, similar to ValueEnd - ValueStart                                                                                                           |
+| Base64Encode       | Boolean   | Is credential a base64 string?                                                                                                                           |
+| HexEncode          | Boolean   | Is credential a hex encoded string? (like `\xFF` or `FF 02 33`)                                                                                          |
+| URLEncode          | Boolean   | Is credential a url encoded string? (like `one%20two`)                                                                                                   |
+| Category           | String    | Labeled data according CredSweeper rules. see [Category](#category).                                                                                     |
 
 ### Category
 

@@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+
+"""
+The script is developed to update meta with absolute positions of value instead from stripped line
+"""
+
+import os
+import subprocess
+import sys
+from argparse import ArgumentParser
+from functools import cache
+
+from meta_row import read_meta
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+
+
+@cache
+def read_cache(path) -> list[str]:
+    with open(path, "r", encoding="utf8") as f:
+        return f.read().replace("\r\n", '\n').replace('\r', '\n').split('\n')
+
+
+def main(meta_dir: str, data_dir: str) -> int:
+    errors = 0
+    updated_rows = 0
+
+    if not os.path.exists(meta_dir):
+        raise FileExistsError(f"{meta_dir} directory does not exist.")
+    if not os.path.exists(data_dir):
+        raise FileExistsError(f"{data_dir} directory does not exist.")
+
+    meta = read_meta(meta_dir)
+    meta.sort(key=lambda x: (x.FilePath, x.LineStart, x.LineEnd, x.ValueStart, x.ValueEnd))
+    for row in meta:
+        offset = offset_aux = 0
+        if 0 <= row.ValueStart:
+            lines = read_cache(row.FilePath)
+            line = lines[row.LineStart - 1]
+            offset = len(line) - len(line.lstrip())
+            row.ValueStart += offset
+            if 0 <= row.ValueEnd:
+                if row.LineStart == row.LineEnd:
+                    row.ValueEnd += offset
+                else:
+                    line_aux = lines[row.LineEnd - 1]
+                    offset_aux = len(line_aux) - len(line_aux.lstrip())
+                    row.ValueEnd += offset_aux
+            if 0 > offset or 0 > offset_aux:
+                errors += 1
+        if 0 < (offset + offset_aux):
+            subprocess.run(
+                ["sed", "-i",
+                 "s|^" + str(row.Id) + ",.*|" + str(row) + "|",
+                 f"{meta_dir}/{row.RepoName}.csv"])
+
+    result = EXIT_SUCCESS if 0 == errors else EXIT_FAILURE
+    print(f"Updated {updated_rows} of {len(meta)}, errors: {errors}, {result}", flush=True)
+    return result
+
+
+if __name__ == "__main__":
+    parser = ArgumentParser(prog=f"python {os.path.basename(__file__)}",
+                            description="Temporally console script for update meta with absolute positions of values")
+
+    parser.add_argument("meta_dir", help="Markup location", nargs='?', default="meta")
+    parser.add_argument("data_dir", help="Dataset location", nargs='?', default="data")
+    _args = parser.parse_args()
+
+    exit_code = main(_args.meta_dir, _args.data_dir)
+    sys.exit(exit_code)
@@ -32,7 +32,7 @@ def run_scanner(self) -> None:
         self.init_scanner()
         subprocess.call([
             "./venv/bin/python", "-m", "credsweeper", "--banner", "--path", f"{self.cred_data_dir}/data", "--jobs", "4",
-            "--save-json", self.output_dir, "--sort"
+            "--save-json", self.output_dir, "--sort", "--subtext"
         ],
             cwd=self.scanner_dir)
 
@@ -61,6 +61,6 @@ def parse_result(self) -> None:
                 self.check_line_from_meta(file_path=meta_cred.path,
                                           line_start=meta_cred.line_start,
                                           line_end=meta_cred.line_end,
-                                          value_start=meta_cred.strip_value_start,
-                                          value_end=meta_cred.strip_value_end,
+                                          value_start=meta_cred.value_start,
+                                          value_end=meta_cred.value_end,
                                           rule=meta_cred.rule)
@@ -1,3 +1,5 @@
+import binascii
+import hashlib
 import os
 from abc import ABC, abstractmethod
 from pathlib import Path
@@ -45,8 +47,19 @@ def output_dir(self) -> str:
     def output_dir(self, output_dir: str) -> None:
         raise NotImplementedError()
 
+    @staticmethod
+    def _meta_checksum(meta_location) -> str:
+        checksum = hashlib.md5(b'').digest()
+        for root, dirs, files in os.walk(meta_location):
+            for file in files:
+                with open(os.path.join(root, file), "rb") as f:
+                    cvs_checksum = hashlib.md5(f.read()).digest()
+                checksum = bytes(a ^ b for a, b in zip(checksum, cvs_checksum))
+        return binascii.hexlify(checksum).decode()
+
     def _prepare_meta(self):
-        for _row in _get_source_gen(Path(f"{self.cred_data_dir}/meta")):
+        meta_path = Path(f"{self.cred_data_dir}/meta")
+        for _row in _get_source_gen(meta_path):
             meta_row = MetaRow(_row)
             meta_key = MetaKey(meta_row)
             if meta_rows := self.meta.get(meta_key):
@@ -78,6 +91,7 @@ def _prepare_meta(self):
             if self.meta_next_id <= meta_row.Id:
                 self.meta_next_id = meta_row.Id + 1
 
+        data_checksum = hashlib.md5(b'').digest()
         # getting count of all not-empty lines
         data_dir = f"{self.cred_data_dir}/data"
         valid_dir_list = ["src", "test", "other"]
@@ -88,8 +102,11 @@ def _prepare_meta(self):
                     file_ext_lower = file_ext.lower()
                     # the type must be in dictionary
                     self.file_types[file_ext_lower].files_number += 1
-                    with open(os.path.join(root, file), "r", encoding="utf8") as f:
-                        lines = f.read().split('\n')
+                    with open(os.path.join(root, file), "rb") as f:
+                        data = f.read()
+                        file_checksum = hashlib.md5(data).digest()
+                        data_checksum = bytes(a ^ b for a, b in zip(data_checksum, file_checksum))
+                        lines = data.decode("utf-8").split('\n')
                         file_data_valid_lines = 0
                         for line in lines:
                             # minimal length of IPv4 detection is 7 e.g. 8.8.8.8
@@ -98,6 +115,8 @@ def _prepare_meta(self):
                         self.total_data_valid_lines += file_data_valid_lines
                         self.file_types[file_ext_lower].valid_lines += file_data_valid_lines
 
+        print(f"META MD5 {self._meta_checksum(meta_path)}", flush=True)
+        print(f"DATA MD5 {binascii.hexlify(data_checksum).decode()}", flush=True)
         print(f"DATA: {self.total_data_valid_lines} interested lines. MARKUP: {len(self.meta)} items", flush=True)
         types_headers = ["FileType", "FileNumber", "ValidLines", "Positives", "Negatives", "Templates"]
         types_rows = []

@@ -428,14 +428,11 @@ def replace_rows(data: List[MetaRow]):
             lines = f.read().split('\n')
 
         old_line = lines[row.LineStart - 1]
-
-        indentation = len(old_line) - len(old_line.lstrip())
-
-        value = old_line[indentation + row.ValueStart:indentation + row.ValueEnd]
+        value = old_line[row.ValueStart:row.ValueEnd]
         # credsweeper does not scan lines over 8000 symbols, so 1<<13 is enough
         random.seed((row.LineStart << 13 + row.ValueStart) ^ int(row.FileID, 16))
         obfuscated_value = get_obfuscated_value(value, row)
-        new_line = old_line[:indentation + row.ValueStart] + obfuscated_value + old_line[indentation + row.ValueEnd:]
+        new_line = old_line[:row.ValueStart] + obfuscated_value + old_line[row.ValueEnd:]
 
         lines[row.LineStart - 1] = new_line
 
@@ -535,14 +532,6 @@ def create_new_multiline(lines: List[str], starting_position: int):
     new_lines = []
 
     first_line = lines[0]
-    # First line might have an offset for variable name
-    if 0 <= starting_position:
-        # starting_position was obtained from stripped line!
-        offset = len(first_line) - len(first_line.lstrip())
-        starting_position += offset
-    else:
-        # empty integers are cast to -1 from csv
-        starting_position = 0
 
     new_lines.append(first_line[:starting_position] + obfuscate_segment(first_line[starting_position:]))
 

@@ -42,9 +42,9 @@ def main(report_file: str, meta_dir: str):
         meta_cred = MetaCred(cred)
         key_variants = [
             # exactly match
-            (meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.strip_value_start, meta_cred.strip_value_end),
+            (meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.value_start, meta_cred.value_end),
             # meta markup only with start position
-            (meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.strip_value_start, -1),
+            (meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.value_start, -1),
             # markup for whole line
             (meta_cred.path, meta_cred.line_start, meta_cred.line_end, -1, -1)
         ]