New patterns & ML update #600

babenek · 2024-08-27T18:06:33Z

Description

Please include a summary of the change and which is fixed.

Refactoring ML features with reducing DC and added regex matching for an attribute
Retrain ML. summary F1 = 0.952886
Added new ML features for using regex in an attribute
ML training produces chart with MD5 sums of config and model in bottom to manage them
Removed VariableNotAllowedPatternCheck filter but apply the pattern in ML. Variuos case with _ID postfixes and AWS keys.
Refactored Keyword pattern to deal with escaped quotes in a value. Tests added. Some cases still are not covered.
Filter ValuePath was extended with morphemes check to reduce FN
Removed duplicated filters

How has this been tested?

Please describe the tests that you ran to verify your changes.

UnitTest
Benchmark

babenek · 2024-08-30T06:38:39Z

@Samsung/credsweeper_maintainers , please give your opinion about the rules names.

babenek

TODO: rollback after Samsung/CredData#167

.github/workflows/benchmark.yml

codecov-commenter · 2024-09-04T19:45:17Z

Codecov Report

Attention: Patch coverage is 92.30769% with 23 lines in your changes missing coverage. Please review.

Project coverage is 90.57%. Comparing base (a73faaa) to head (355abb2).

Files with missing lines	Patch %	Lines
credsweeper/ml_model/features/has_html_tag.py	77.77%	2 Missing and 2 partials ⚠️
credsweeper/ml_model/features/feature.py	86.95%	2 Missing and 1 partial ⚠️
credsweeper/ml_model/features/word_in.py	92.50%	2 Missing and 1 partial ⚠️
credsweeper/ml_model/features/word_in_path.py	80.00%	2 Missing and 1 partial ⚠️
credsweeper/credentials/line_data.py	90.47%	0 Missing and 2 partials ⚠️
credsweeper/ml_model/features/word_in_line.py	85.71%	1 Missing and 1 partial ⚠️
credsweeper/ml_model/features/word_in_value.py	83.33%	1 Missing and 1 partial ⚠️
credsweeper/utils/util.py	60.00%	1 Missing and 1 partial ⚠️
credsweeper/ml_model/features/file_extension.py	91.66%	1 Missing ⚠️
credsweeper/ml_model/features/rule_name.py	91.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #600      +/-   ##
==========================================
- Coverage   90.63%   90.57%   -0.06%     
==========================================
  Files         130      143      +13     
  Lines        4794     4870      +76     
  Branches      783      786       +3     
==========================================
+ Hits         4345     4411      +66     
- Misses        292      298       +6     
- Partials      157      161       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

babenek · 2024-09-05T04:32:24Z

bencmark fails before Samsung/CredData#167

csh519

Please check my comments below.

And next time, please separate PR by feature and request review.
As you know huge PR couldn't have good review.

csh519 · 2024-09-05T06:12:09Z

credsweeper/credentials/line_data.py

-        while self.value.endswith('}') and '{' in self.line[:self.value_start]:
-            self.value = self.value[:-1]
+        """Parenthesis, curly and squared brackets may be caught in TOML format and bash. Simple clearing"""
+        dirty = self.value and self.value[-1] in ['}', ']', ')']


dirty is also good name but IMO it would be good to re-name with like as is_OOO.
How about is_cleared or is_clear?

csh519 · 2024-09-05T06:15:37Z

credsweeper/filters/value_method_check.py

-        if "function" in line_data.value or self.PATTERN.search(line_data.value):
+        if "function" in line_data.value or self.PATTERN.match(line_data.value):


Is there any FP cases that include function keyword in the value?
I think checking function keyword by search() seems reasonable..

actually, there was FN case in random generated credentials like "12#@5(!)fs", so I refactored the pattern and applied match to use the search from begin of a value

Hmm.. still i can get it clearly.
Do you mean there is a random generated credential value which includes "function" keyword in it?
Like as "12#@5(!)function3w@91"?
If then i think search() method working good.

No, i mean the second part of OR condition in the check only.
The function check does not use regex, so i did not find any FP or FN for function yet.
Otherside, "12#@5(!)fs" - or something like this was set as FN with the filter.

Ah.. I just misunderstood about this.
But in some case like below, search() method still more useful doesn't it?

myPassword = "this_is_not_password"

The filter does not return True for the line. Only if it will be

myPassword = this_is_not_password()

or

myPasswsord=password_function

*depends on file

Otherside, your code line means that a string is assigned for variable with name "password"
So, it may be a password. Even "1234" may be a password, but CredSweeper skips the sequence.

Okay i agree. Let's keep the change then.

csh519 · 2024-09-05T06:17:00Z

credsweeper/filters/variable_not_allowed_pattern_check.py

@@ -1,36 +0,0 @@
-import re


Why this filter has been removed?
There is no description for it..

AWS_KEY_ID is filtered often :(
The feature was migrated to ML

VariableNotAllowedPatternCheck mention was added to PR description

Okay I see..
Is there a lot of patterns that do not allowed?
I think if we delegate this filter's role to ML, we will be harder to reasoning and debugging why.
So I'm worry about this.

The filter eliminated a chance for ML to report for a credential like:

PASSWORD_FOR_ID="Th@tHid$mYpDW"

and something sensetive data like AWS ID.

It seems filtering the credentials from ML will be better..
Okay let's move the role to ML.

docs/source/credsweeper.ml_model.features.rst

babenek · 2024-09-10T13:41:31Z

Yullia

Approved

babenek changed the title ~~Auxiliary~~ ML update feature WordsInPath Aug 28, 2024

babenek mentioned this pull request Aug 28, 2024

corrections and migration to various path Samsung/CredData#164

Merged

babenek changed the title ~~ML update feature WordsInPath~~ New patterns & ML update Aug 30, 2024

babenek requested review from xDizzix, Yullia, yoonhyuckjoon, iuriimet, Dmitriy-NK, csh519 and kmnls August 30, 2024 06:37

babenek mentioned this pull request Sep 3, 2024

New rules markup Samsung/CredData#166

Merged

babenek commented Sep 3, 2024

View reviewed changes

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved

babenek marked this pull request as ready for review September 3, 2024 10:42

babenek requested a review from a team as a code owner September 3, 2024 10:42

babenek marked this pull request as draft September 3, 2024 10:47

babenek mentioned this pull request Sep 4, 2024

Corrections Samsung/CredData#167

Merged

babenek marked this pull request as ready for review September 5, 2024 04:32

csh519 reviewed Sep 5, 2024

View reviewed changes

babenek marked this pull request as draft September 5, 2024 07:36

babenek force-pushed the auxiliary branch from 926f286 to d5b9c0f Compare September 5, 2024 15:29

babenek requested a review from csh519 September 6, 2024 06:06

babenek force-pushed the auxiliary branch 2 times, most recently from f3241c5 to 330cc04 Compare September 8, 2024 13:23

babenek mentioned this pull request Sep 9, 2024

KeywordPattern regex improvement #606

Merged

2 tasks

babenek commented Sep 10, 2024

View reviewed changes

docs/source/credsweeper.ml_model.features.rst Outdated Show resolved Hide resolved

babenek marked this pull request as ready for review September 10, 2024 13:47

xDizzix previously approved these changes Sep 16, 2024

View reviewed changes

babenek marked this pull request as draft September 16, 2024 08:13

softreset

355abb2

babenek dismissed xDizzix’s stale review via 355abb2 September 16, 2024 09:07

babenek force-pushed the auxiliary branch from 764cd8a to 355abb2 Compare September 16, 2024 09:07

babenek added 2 commits September 16, 2024 12:13

not whitespaces in pattern

0dd38a2

BMfix

7c58f91

babenek marked this pull request as ready for review September 16, 2024 10:18

light enhancement

b56845c

babenek requested a review from xDizzix September 16, 2024 10:30

babenek added 4 commits September 16, 2024 14:10

DC fix

c79499a

Tests for coverage

625a0f3

tests and fix

cf0db9e

tests and fix WordInPath

2cdd562

babenek mentioned this pull request Sep 17, 2024

Version up to v1.9.0 #603

Merged

2 tasks

Yullia approved these changes Sep 19, 2024

View reviewed changes

xDizzix approved these changes Sep 19, 2024

View reviewed changes

babenek merged commit 80da63d into Samsung:main Sep 19, 2024
27 checks passed

babenek deleted the auxiliary branch September 19, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New patterns & ML update #600

New patterns & ML update #600

babenek commented Aug 27, 2024 •

edited

Loading

babenek commented Aug 30, 2024

babenek left a comment •

edited

Loading

codecov-commenter commented Sep 4, 2024 •

edited

Loading

babenek commented Sep 5, 2024

csh519 left a comment

csh519 Sep 5, 2024

csh519 Sep 5, 2024

babenek Sep 5, 2024

csh519 Sep 6, 2024

babenek Sep 6, 2024 •

edited

Loading

csh519 Sep 6, 2024

babenek Sep 6, 2024

csh519 Sep 6, 2024

csh519 Sep 5, 2024

babenek Sep 5, 2024

babenek Sep 5, 2024

csh519 Sep 6, 2024

babenek Sep 6, 2024

csh519 Sep 9, 2024

babenek commented Sep 10, 2024

Yullia left a comment

		if "function" in line_data.value or self.PATTERN.search(line_data.value):
		if "function" in line_data.value or self.PATTERN.match(line_data.value):

New patterns & ML update #600

New patterns & ML update #600

Conversation

babenek commented Aug 27, 2024 • edited Loading

Description

How has this been tested?

babenek commented Aug 30, 2024

babenek left a comment • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Sep 4, 2024 • edited Loading

Codecov Report

babenek commented Sep 5, 2024

csh519 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

babenek Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

babenek commented Sep 10, 2024

Yullia left a comment

Choose a reason for hiding this comment

babenek commented Aug 27, 2024 •

edited

Loading

babenek left a comment •

edited

Loading

codecov-commenter commented Sep 4, 2024 •

edited

Loading

babenek Sep 6, 2024 •

edited

Loading