Skip to content

Commit

Permalink
ML rules fixes, new rule for msgpack-numpy (#39)
Browse files Browse the repository at this point in the history
* improves some rules and tests

* msgpack-numpy rule

* new rule for pandas eval functions

* keras and tf loading functions

* signed commit

* new rules: keras and tf loading

* fix formatting, less FPs for pandas-eval

* pandas-eval metadata

* pandas-eval, add testcase for empty f-string

* numpy-in-pytorch-modules, generic match

* pickles-in-keras-deprecation - regex

---------

Co-authored-by: Lucas Bourtoule <[email protected]>
Co-authored-by: Paweł Płatek <[email protected]>
Co-authored-by: GrosQuildu <[email protected]>
  • Loading branch information
4 people authored Apr 3, 2024
1 parent 772f68e commit 69fd8d8
Show file tree
Hide file tree
Showing 15 changed files with 336 additions and 2 deletions.
4 changes: 4 additions & 0 deletions python/lxml-in-pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,7 @@
kwargs2 = {"io": touch, "flavor":"html5lib"}
# ok: lxml-in-pandas
pd.read_html(**kwargs2)

def test(**kwargs):
# ruleid: lxml-in-pandas
pd.read_html(**kwargs)
31 changes: 31 additions & 0 deletions python/msgpack-numpy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import msgpack
import msgpack_numpy as m
import numpy as np

x = np.random.rand(5)
# ruleid: msgpack-numpy
x_enc = msgpack.packb(x, default=m.encode)
# ruleid: msgpack-numpy
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)

# ok: msgpack-numpy
x_enc2 = msgpack.packb(x)
# ok: msgpack-numpy
x_rec2 = msgpack.unpackb(x_enc2)

# ok: msgpack-numpy
x_enc3 = msgpack.load(x)
# ok: msgpack-numpy
x_rec3 = msgpack.loads(x_enc2)

m.patch()

# ruleid: msgpack-numpy
x_enc3 = msgpack.packb(x)
# ruleid: msgpack-numpy
x_rec3 = msgpack.unpackb(x_enc2)

# ruleid: msgpack-numpy
x_enc3 = msgpack.load(x)
# ruleid: msgpack-numpy
x_rec3 = msgpack.loads(x_enc2)
41 changes: 41 additions & 0 deletions python/msgpack-numpy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
rules:
- id: msgpack-numpy
message: >-
Found usage of msgpack-numpy unpacking, which relies on pickle to deserialize numpy arrays containing objects.
Functions reliant on pickle can result in arbitrary code execution.
Consider switching to a safer serialization method.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [numpy]
description: "Potential arbitrary code execution from functions reliant on pickling"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

pattern-either:
- patterns:
- pattern: msgpack.$FN(...)
- metavariable-regex:
metavariable: $FN
regex: (loads?|dumps?|packb?|unpackb?)
- pattern-inside: |
msgpack_numpy.patch()
...
- patterns:
- pattern: msgpack.$FN(..., object_hook=msgpack_numpy.decode, ...)
- metavariable-regex:
metavariable: $FN
regex: unpackb?

- patterns:
- pattern: msgpack.$FN(..., default=msgpack_numpy.encode, ...)
- metavariable-regex:
metavariable: $FN
regex: packb?
3 changes: 3 additions & 0 deletions python/numpy-in-pytorch-modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ def forward(self, x, y):
# ruleid: numpy-in-pytorch-modules
y = np.concatenate((x, y), axis=1)

# ruleid: numpy-in-pytorch-modules
np.ndarray.sort(y)

def forward_correct(self, x, y):
x = self.dropout(x)
# ok: numpy-in-pytorch-modules
Expand Down
4 changes: 3 additions & 1 deletion python/numpy-in-pytorch-modules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ rules:
- https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects

patterns:
- pattern: $RESULT = numpy.$FUNCTION(...)
- pattern-either:
- pattern: numpy.$FN(...)
- pattern: numpy. ... .$FN(...)
- pattern-inside: |
class $MODULE(torch.nn.Module):
...
50 changes: 50 additions & 0 deletions python/pandas-eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import pandas as pd

def id(x):
return x

expr = id("something")
colB = id("B")
subexpr = id("df.age * 2")

df1 = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
# ok: pandas-eval
r11 = df1.eval('A + B')
# ruleid: pandas-eval
r12 = df1.eval(expr)
# ok: pandas-eval
r13 = df1.eval(f"A + B")
# ruleid: pandas-eval
r14 = df1.eval(f"A + {colB}")
# ok: pandas-eval
r15 = df1.eval(f"")

df2 = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
# ok: pandas-eval
pd.eval("double_age = df.age * 2", target=df2)
# ruleid: pandas-eval
pd.eval(expr, target=df2)
# ok: pandas-eval
pd.eval(f"double_age = df.age * 2", target=df2)
# ruleid: pandas-eval
pd.eval(f"double_age = {subexpr}", target=df2)

df3 = pd.DataFrame({
'A': range(1, 6),
'B': range(10, 0, -2)
})
# ok: pandas-eval
r31 = df3.query('A > B')
# ruleid: pandas-eval
r32 = df3.query(expr)
# ok: pandas-eval
r33 = df3.query(f'A > B')
# ruleid: pandas-eval
r34 = df3.query(f'A > {colB}')

class X:
def query(self, x):
pass

# ok: pandas-eval
X().query(expr)
48 changes: 48 additions & 0 deletions python/pandas-eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
rules:
- id: pandas-eval
message: >-
Pandas eval() and query() may be dangerous if used to evaluate
dynamic content. If this content can be input from outside the program, this
may be a code injection vulnerability. Ensure evaluated content is not definable
by external sources.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-95: Improper Neutralization of Directives in Dynamically Evaluated Code ('Eval Injection')"
subcategory:
- audit
confidence: LOW
likelihood: LOW
impact: HIGH
technology: [pandas]
description: "Potential arbitrary code execution from `pandas` functions that evaluate user-provided expressions"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern-inside: |
import pandas
...
- pattern-either:
- patterns:
- pattern: pandas.DataFrame.$FN(...)
- pattern-not: pandas.DataFrame.$FN("...", ...)
- pattern-not: pandas.DataFrame.$FN(f"", ...)

- patterns:
- pattern: pandas.$FN(...)
- pattern-not: pandas.$FN("...", ...)
- pattern-not: pandas.$FN(f"", ...)

- patterns:
- pattern-inside: |
$DF = pandas.DataFrame(...)
...
- pattern: $DF.$FN(...)
- pattern-not: $DF.$FN("...", ...)
- pattern-not: $DF.$FN(f"", ...)

- metavariable-regex:
metavariable: $FN
regex: (eval|query)
20 changes: 20 additions & 0 deletions python/pickles-in-keras-deprecation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from tensorflow import keras
from keras.models import load_model

def id(x):
return x

h5_file_path = id("model.h5")
keras_file_path = id("model.keras")

# ruleid: pickles-in-keras-deprecation
m1 = load_model("model.h5")

# ok: pickles-in-keras-deprecation
m2 = load_model("model.keras")

# ruleid: pickles-in-keras-deprecation
m3 = load_model(h5_file_path)

# ruleid: pickles-in-keras-deprecation
m4 = load_model(keras_file_path)
36 changes: 36 additions & 0 deletions python/pickles-in-keras-deprecation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
rules:
- id: pickles-in-keras-deprecation
message: >-
The usage of pickle and hdf5 formats for model files are deprecated in Keras.
The keras.models.load_model function is deprecated as well. Keras is now
embedded in Tensorflow 2 under tensorflow.keras.
languages: [python]
severity: WARNING
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [keras]
description: "Potential arbitrary code execution from Keras' load_model function"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern-either:
- pattern: keras.models.load_model(...)
- pattern: tensorflow.keras.models.load_model(...)
- pattern: keras.saving.load_model(...)
- pattern: tensorflow.keras.saving.load_model(...)
- pattern-not:
patterns:
- pattern-either:
- pattern: keras.models.load_model($FILE)
- pattern: tensorflow.keras.models.load_model($FILE)
- pattern: keras.saving.load_model($FILE)
- pattern: tensorflow.keras.saving.load_model($FILE)
- metavariable-regex:
metavariable: $FILE
regex: .*\.keras
20 changes: 20 additions & 0 deletions python/pickles-in-keras.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from tensorflow import keras
from keras.models import load_model

def id(x):
return x

h5_file_path = id("model.h5")
keras_file_path = id("model.keras")

# ok: pickles-in-keras
m1 = load_model("model.h5")

# ok: pickles-in-keras
m2 = load_model("model.keras")

# ruleid: pickles-in-keras
m3 = load_model(h5_file_path)

# ruleid: pickles-in-keras
m4 = load_model(keras_file_path)
36 changes: 36 additions & 0 deletions python/pickles-in-keras.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
rules:
- id: pickles-in-keras
message: >-
Keras' load_model function may result in arbitrary code execution:
- It can load vulnerable pickled models
- It can load an hdf5 model that contains a lambda layer with arbitrary code
that will be executed every time the model is used (loading, training, eval)
Note: Keras loading with the built-in file format should be safe as long as checks are not disabled.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [keras]
description: "Potential arbitrary code execution from Keras' load_model function"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern-either:
- patterns:
- pattern: keras.models.load_model(...)
- pattern-not: keras.models.load_model("...", ...)
- patterns:
- pattern: tensorflow.keras.models.load_model(...)
- pattern-not: tensorflow.keras.models.load_model("...", ...)
- patterns:
- pattern: keras.saving.load_model(...)
- pattern-not: keras.saving.load_model("...", ...)
- patterns:
- pattern: tensorflow.keras.saving.load_model(...)
- pattern-not: tensorflow.keras.saving.load_model("...", ...)
4 changes: 4 additions & 0 deletions python/pickles-in-pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,7 @@ def test(arg):

# ok: pickles-in-pytorch
model.load_state_dict(torch.load(arg))

state_dict = model.state_dict()
# ok: pickles-in-pytorch
torch.save(state_dict, arg)
9 changes: 8 additions & 1 deletion python/pickles-in-pytorch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,11 @@ rules:
- pattern-not: torch.load("...")
- pattern-not: torch.save(..., "...")
- pattern-not: torch.save($M.state_dict(), ...)
- pattern-not-inside: $M.load_state_dict(torch.load(...))
- pattern-not-inside: $M.load_state_dict(...)

- pattern-not:
patterns:
- pattern: torch.save($STATE_DICT, ...)
- pattern-inside: |
$STATE_DICT = $M.state_dict()
...
11 changes: 11 additions & 0 deletions python/pickles-in-tensorflow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import tensorflow as tf

def id(x):
return x

model_dir = id("model_dir")

# ok: pickles-in-tensorflow
m1 = tf.saved_model.load("model_dir")
# ruleid: pickles-in-tensorflow
m2 = tf.saved_model.load(model_dir)
21 changes: 21 additions & 0 deletions python/pickles-in-tensorflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
rules:
- id: pickles-in-tensorflow
message: >-
Tensorflow's low-level load function may result in arbitrary code execution.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [keras]
description: "Potential arbitrary code execution from tensorflow's load function"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern: tensorflow.saved_model.load(...)
- pattern-not: tensorflow.saved_model.load("...", ...)

0 comments on commit 69fd8d8

Please sign in to comment.