Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absolute positions of value #160

Merged
merged 11 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 25 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,31 +132,31 @@ In order to compose an accurate Ground Truth set, we proceed data review based o
Metadata includes Ground Truth values and additional information for credential lines detected by various tools.

### Properties on the Metadata
| Name of property | Data Type | Description |
|--------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------|
| Id | Integer | Credential ID |
| FileID | String | Filename hash. Used to download correct file from a external repo |
| Domain | String | Domain of repository. (ex. Github) |
| RepoName | String | Project name that credential was found |
| FilePath | String | File path where credential information was included |
| LineStart | Integer | Line start in file from 1, like in most editors. In common cases it equals LineEnd. |
| LineEnd | Integer | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this. |
| GroundTruth | String | Ground Truth of this credential. True (T) / False (F) or Template |
| WithWords | Boolean | Flag to indicate word(https://github.com/first20hours/google-10000-english) is included on the credential |
| ValueStart | Integer | Index of value on the line after lstrip(). This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
| ValueEnd | Integer | Index of character right after value ends in the line (after lstrip). This is end position on LineEnd. May be -1 or empty. |
| InURL | Boolean | Flag to indicate if credential is a part of a URL, such as "http://user:[email protected]" |
| InRuntimeParameter | Boolean | Flag to indicate if credential is in runtime parameter |
| CharacterSet | String | Characters used in the credential (NumberOnly, CharOnly, Any) |
| CryptographyKey | String | Type of a key: Private or Public |
| PredefinedPattern | String | Credential with defined regex patterns (AWS token with `AKIA...` pattern) |
| VariableNameType | String | Categorize credentials by variable name into Secret, Key, Token, SeedSalt and Auth |
| Entropy | Float | Shanon entropy of a credential |
| Length | Integer | Value length, similar to ValueEnd - ValueStart |
| Base64Encode | Boolean | Is credential a base64 string? |
| HexEncode | Boolean | Is credential a hex encoded string? (like `\xFF` or `FF 02 33`) |
| URLEncode | Boolean | Is credential a url encoded string? (like `one%20two`) |
| Category | String | Labeled data according CredSweeper rules. see [Category](#category). |
| Name of property | Data Type | Description |
|--------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Id | Integer | Credential ID |
| FileID | String | Filename hash. Used to download correct file from a external repo |
| Domain | String | Domain of repository. (ex. Github) |
| RepoName | String | Project name that credential was found |
| FilePath | String | File path where credential information was included |
| LineStart | Integer | Line start in file from 1, like in most editors. In common cases it equals LineEnd. |
| LineEnd | Integer | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this. |
| GroundTruth | String | Ground Truth of this credential. True (T) / False (F) or Template |
| WithWords | Boolean | Flag to indicate word(https://github.com/first20hours/google-10000-english) is included on the credential |
| ValueStart | Integer | Index of value on the line like in CredSweeper report. This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
| ValueEnd | Integer | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty. |
| InURL | Boolean | Flag to indicate if credential is a part of a URL, such as "http://user:[email protected]" |
| InRuntimeParameter | Boolean | Flag to indicate if credential is in runtime parameter |
| CharacterSet | String | Characters used in the credential (NumberOnly, CharOnly, Any) |
| CryptographyKey | String | Type of a key: Private or Public |
| PredefinedPattern | String | Credential with defined regex patterns (AWS token with `AKIA...` pattern) |
| VariableNameType | String | Categorize credentials by variable name into Secret, Key, Token, SeedSalt and Auth |
| Entropy | Float | Shanon entropy of a credential |
| Length | Integer | Value length, similar to ValueEnd - ValueStart |
| Base64Encode | Boolean | Is credential a base64 string? |
| HexEncode | Boolean | Is credential a hex encoded string? (like `\xFF` or `FF 02 33`) |
| URLEncode | Boolean | Is credential a url encoded string? (like `one%20two`) |
| Category | String | Labeled data according CredSweeper rules. see [Category](#category). |

### Category

Expand Down
72 changes: 72 additions & 0 deletions abs_pos_update_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#!/usr/bin/env python3

"""
The script is developed to update meta with absolute positions of value instead from stripped line
"""

import os
import subprocess
import sys
from argparse import ArgumentParser
from functools import cache

from meta_row import read_meta

EXIT_SUCCESS = 0
EXIT_FAILURE = 1


@cache
def read_cache(path) -> list[str]:
with open(path, "r", encoding="utf8") as f:
return f.read().replace("\r\n", '\n').replace('\r', '\n').split('\n')


def main(meta_dir: str, data_dir: str) -> int:
errors = 0
updated_rows = 0

if not os.path.exists(meta_dir):
raise FileExistsError(f"{meta_dir} directory does not exist.")
if not os.path.exists(data_dir):
raise FileExistsError(f"{data_dir} directory does not exist.")

meta = read_meta(meta_dir)
meta.sort(key=lambda x: (x.FilePath, x.LineStart, x.LineEnd, x.ValueStart, x.ValueEnd))
for row in meta:
offset = offset_aux = 0
if 0 <= row.ValueStart:
lines = read_cache(row.FilePath)
line = lines[row.LineStart - 1]
offset = len(line) - len(line.lstrip())
row.ValueStart += offset
if 0 <= row.ValueEnd:
if row.LineStart == row.LineEnd:
row.ValueEnd += offset
else:
line_aux = lines[row.LineEnd - 1]
offset_aux = len(line_aux) - len(line_aux.lstrip())
row.ValueEnd += offset_aux
if 0 > offset or 0 > offset_aux:
errors += 1
if 0 < (offset + offset_aux):
subprocess.run(
["sed", "-i",
"s|^" + str(row.Id) + ",.*|" + str(row) + "|",
f"{meta_dir}/{row.RepoName}.csv"])

result = EXIT_SUCCESS if 0 == errors else EXIT_FAILURE
print(f"Updated {updated_rows} of {len(meta)}, errors: {errors}, {result}", flush=True)
return result


if __name__ == "__main__":
parser = ArgumentParser(prog=f"python {os.path.basename(__file__)}",
description="Temporally console script for update meta with absolute positions of values")

parser.add_argument("meta_dir", help="Markup location", nargs='?', default="meta")
parser.add_argument("data_dir", help="Dataset location", nargs='?', default="data")
_args = parser.parse_args()

exit_code = main(_args.meta_dir, _args.data_dir)
sys.exit(exit_code)
6 changes: 3 additions & 3 deletions benchmark/scanner/credsweeper.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def run_scanner(self) -> None:
self.init_scanner()
subprocess.call([
"./venv/bin/python", "-m", "credsweeper", "--banner", "--path", f"{self.cred_data_dir}/data", "--jobs", "4",
"--save-json", self.output_dir, "--sort"
"--save-json", self.output_dir, "--sort", "--subtext"
],
cwd=self.scanner_dir)

Expand Down Expand Up @@ -61,6 +61,6 @@ def parse_result(self) -> None:
self.check_line_from_meta(file_path=meta_cred.path,
line_start=meta_cred.line_start,
line_end=meta_cred.line_end,
value_start=meta_cred.strip_value_start,
value_end=meta_cred.strip_value_end,
value_start=meta_cred.value_start,
value_end=meta_cred.value_end,
rule=meta_cred.rule)
25 changes: 22 additions & 3 deletions benchmark/scanner/scanner.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import binascii
import hashlib
import os
from abc import ABC, abstractmethod
from pathlib import Path
Expand Down Expand Up @@ -45,8 +47,19 @@ def output_dir(self) -> str:
def output_dir(self, output_dir: str) -> None:
raise NotImplementedError()

@staticmethod
def _meta_checksum(meta_location) -> str:
checksum = hashlib.md5(b'').digest()
for root, dirs, files in os.walk(meta_location):
for file in files:
with open(os.path.join(root, file), "rb") as f:
cvs_checksum = hashlib.md5(f.read()).digest()
checksum = bytes(a ^ b for a, b in zip(checksum, cvs_checksum))
return binascii.hexlify(checksum).decode()

def _prepare_meta(self):
for _row in _get_source_gen(Path(f"{self.cred_data_dir}/meta")):
meta_path = Path(f"{self.cred_data_dir}/meta")
for _row in _get_source_gen(meta_path):
meta_row = MetaRow(_row)
meta_key = MetaKey(meta_row)
if meta_rows := self.meta.get(meta_key):
Expand Down Expand Up @@ -78,6 +91,7 @@ def _prepare_meta(self):
if self.meta_next_id <= meta_row.Id:
self.meta_next_id = meta_row.Id + 1

data_checksum = hashlib.md5(b'').digest()
# getting count of all not-empty lines
data_dir = f"{self.cred_data_dir}/data"
valid_dir_list = ["src", "test", "other"]
Expand All @@ -88,8 +102,11 @@ def _prepare_meta(self):
file_ext_lower = file_ext.lower()
# the type must be in dictionary
self.file_types[file_ext_lower].files_number += 1
with open(os.path.join(root, file), "r", encoding="utf8") as f:
lines = f.read().split('\n')
with open(os.path.join(root, file), "rb") as f:
data = f.read()
file_checksum = hashlib.md5(data).digest()
data_checksum = bytes(a ^ b for a, b in zip(data_checksum, file_checksum))
lines = data.decode("utf-8").split('\n')
file_data_valid_lines = 0
for line in lines:
# minimal length of IPv4 detection is 7 e.g. 8.8.8.8
Expand All @@ -98,6 +115,8 @@ def _prepare_meta(self):
self.total_data_valid_lines += file_data_valid_lines
self.file_types[file_ext_lower].valid_lines += file_data_valid_lines

print(f"META MD5 {self._meta_checksum(meta_path)}", flush=True)
print(f"DATA MD5 {binascii.hexlify(data_checksum).decode()}", flush=True)
print(f"DATA: {self.total_data_valid_lines} interested lines. MARKUP: {len(self.meta)} items", flush=True)
types_headers = ["FileType", "FileNumber", "ValidLines", "Positives", "Negatives", "Templates"]
types_rows = []
Expand Down
15 changes: 2 additions & 13 deletions download_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -428,14 +428,11 @@ def replace_rows(data: List[MetaRow]):
lines = f.read().split('\n')

old_line = lines[row.LineStart - 1]

indentation = len(old_line) - len(old_line.lstrip())

value = old_line[indentation + row.ValueStart:indentation + row.ValueEnd]
value = old_line[row.ValueStart:row.ValueEnd]
# credsweeper does not scan lines over 8000 symbols, so 1<<13 is enough
random.seed((row.LineStart << 13 + row.ValueStart) ^ int(row.FileID, 16))
obfuscated_value = get_obfuscated_value(value, row)
new_line = old_line[:indentation + row.ValueStart] + obfuscated_value + old_line[indentation + row.ValueEnd:]
new_line = old_line[:row.ValueStart] + obfuscated_value + old_line[row.ValueEnd:]

lines[row.LineStart - 1] = new_line

Expand Down Expand Up @@ -535,14 +532,6 @@ def create_new_multiline(lines: List[str], starting_position: int):
new_lines = []

first_line = lines[0]
# First line might have an offset for variable name
if 0 <= starting_position:
# starting_position was obtained from stripped line!
offset = len(first_line) - len(first_line.lstrip())
starting_position += offset
else:
# empty integers are cast to -1 from csv
starting_position = 0

new_lines.append(first_line[:starting_position] + obfuscate_segment(first_line[starting_position:]))

Expand Down
4 changes: 2 additions & 2 deletions markup_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ def main(report_file: str, meta_dir: str):
meta_cred = MetaCred(cred)
key_variants = [
# exactly match
(meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.strip_value_start, meta_cred.strip_value_end),
(meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.value_start, meta_cred.value_end),
# meta markup only with start position
(meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.strip_value_start, -1),
(meta_cred.path, meta_cred.line_start, meta_cred.line_end, meta_cred.value_start, -1),
# markup for whole line
(meta_cred.path, meta_cred.line_start, meta_cred.line_end, -1, -1)
]
Expand Down
Loading
Loading