Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML rules fixes, new rule for msgpack-numpy #39

Merged
merged 14 commits into from
Apr 3, 2024
4 changes: 4 additions & 0 deletions python/lxml-in-pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,7 @@
kwargs2 = {"io": touch, "flavor":"html5lib"}
# ok: lxml-in-pandas
pd.read_html(**kwargs2)

def test(**kwargs):
# ruleid: lxml-in-pandas
pd.read_html(**kwargs)
31 changes: 31 additions & 0 deletions python/msgpack-numpy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import msgpack
import msgpack_numpy as m
import numpy as np

x = np.random.rand(5)
# ruleid: msgpack-numpy
x_enc = msgpack.packb(x, default=m.encode)
# ruleid: msgpack-numpy
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)

# ok: msgpack-numpy
x_enc2 = msgpack.packb(x)
# ok: msgpack-numpy
x_rec2 = msgpack.unpackb(x_enc2)

# ok: msgpack-numpy
x_enc3 = msgpack.load(x)
# ok: msgpack-numpy
x_rec3 = msgpack.loads(x_enc2)

m.patch()

# ruleid: msgpack-numpy
x_enc3 = msgpack.packb(x)
# ruleid: msgpack-numpy
x_rec3 = msgpack.unpackb(x_enc2)

# ruleid: msgpack-numpy
x_enc3 = msgpack.load(x)
# ruleid: msgpack-numpy
x_rec3 = msgpack.loads(x_enc2)
41 changes: 41 additions & 0 deletions python/msgpack-numpy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
rules:
- id: msgpack-numpy
message: >-
Found usage of msgpack-numpy unpacking, which relies on pickle to deserialize numpy arrays containing objects.
Functions reliant on pickle can result in arbitrary code execution.
Consider switching to a safer serialization method.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [numpy]
description: "Potential arbitrary code execution from functions reliant on pickling"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

pattern-either:
- patterns:
- pattern: msgpack.$FN(...)
- metavariable-regex:
metavariable: $FN
regex: (loads?|dumps?|packb?|unpackb?)
- pattern-inside: |
msgpack_numpy.patch()
...

- patterns:
- pattern: msgpack.$FN(..., object_hook=msgpack_numpy.decode, ...)
- metavariable-regex:
metavariable: $FN
regex: unpackb?

- patterns:
- pattern: msgpack.$FN(..., default=msgpack_numpy.encode, ...)
- metavariable-regex:
metavariable: $FN
regex: packb?
3 changes: 3 additions & 0 deletions python/numpy-in-pytorch-modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ def forward(self, x, y):
# ruleid: numpy-in-pytorch-modules
y = np.concatenate((x, y), axis=1)

# ruleid: numpy-in-pytorch-modules
np.ndarray.sort(y)

def forward_correct(self, x, y):
x = self.dropout(x)
# ok: numpy-in-pytorch-modules
Expand Down
4 changes: 3 additions & 1 deletion python/numpy-in-pytorch-modules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ rules:
- https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects

patterns:
- pattern: $RESULT = numpy.$FUNCTION(...)
- pattern-either:
- pattern: numpy.$FN(...)
- pattern: numpy. ... .$FN(...)
- pattern-inside: |
class $MODULE(torch.nn.Module):
...
50 changes: 50 additions & 0 deletions python/pandas-eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import pandas as pd

def id(x):
return x

expr = id("something")
colB = id("B")
subexpr = id("df.age * 2")

df1 = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
# ok: pandas-eval
r11 = df1.eval('A + B')
# ruleid: pandas-eval
r12 = df1.eval(expr)
# ok: pandas-eval
r13 = df1.eval(f"A + B")
# ruleid: pandas-eval
r14 = df1.eval(f"A + {colB}")
# ok: pandas-eval
r15 = df1.eval(f"")

df2 = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
# ok: pandas-eval
pd.eval("double_age = df.age * 2", target=df2)
# ruleid: pandas-eval
pd.eval(expr, target=df2)
# ok: pandas-eval
pd.eval(f"double_age = df.age * 2", target=df2)
# ruleid: pandas-eval
pd.eval(f"double_age = {subexpr}", target=df2)

df3 = pd.DataFrame({
'A': range(1, 6),
'B': range(10, 0, -2)
})
# ok: pandas-eval
r31 = df3.query('A > B')
# ruleid: pandas-eval
r32 = df3.query(expr)
# ok: pandas-eval
r33 = df3.query(f'A > B')
# ruleid: pandas-eval
r34 = df3.query(f'A > {colB}')

class X:
def query(self, x):
pass

# ok: pandas-eval
X().query(expr)
48 changes: 48 additions & 0 deletions python/pandas-eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
rules:
- id: pandas-eval
message: >-
Pandas eval() and query() may be dangerous if used to evaluate
dynamic content. If this content can be input from outside the program, this
may be a code injection vulnerability. Ensure evaluated content is not definable
by external sources.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-95: Improper Neutralization of Directives in Dynamically Evaluated Code ('Eval Injection')"
subcategory:
- audit
confidence: LOW
likelihood: LOW
impact: HIGH
technology: [pandas]
description: "Potential arbitrary code execution from `pandas` functions that evaluate user-provided expressions"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern-inside: |
import pandas
...
- pattern-either:
- patterns:
- pattern: pandas.DataFrame.$FN(...)
- pattern-not: pandas.DataFrame.$FN("...", ...)
- pattern-not: pandas.DataFrame.$FN(f"", ...)
GrosQuildu marked this conversation as resolved.
Show resolved Hide resolved

- patterns:
- pattern: pandas.$FN(...)
- pattern-not: pandas.$FN("...", ...)
- pattern-not: pandas.$FN(f"", ...)

- patterns:
- pattern-inside: |
$DF = pandas.DataFrame(...)
...
- pattern: $DF.$FN(...)
- pattern-not: $DF.$FN("...", ...)
- pattern-not: $DF.$FN(f"", ...)

- metavariable-regex:
metavariable: $FN
regex: (eval|query)
20 changes: 20 additions & 0 deletions python/pickles-in-keras-deprecation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from tensorflow import keras
from keras.models import load_model

def id(x):
return x

h5_file_path = id("model.h5")
keras_file_path = id("model.keras")

# ruleid: pickles-in-keras-deprecation
m1 = load_model("model.h5")

# ok: pickles-in-keras-deprecation
m2 = load_model("model.keras")

# ruleid: pickles-in-keras-deprecation
m3 = load_model(h5_file_path)

# ruleid: pickles-in-keras-deprecation
m4 = load_model(keras_file_path)
36 changes: 36 additions & 0 deletions python/pickles-in-keras-deprecation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
rules:
- id: pickles-in-keras-deprecation
message: >-
The usage of pickle and hdf5 formats for model files are deprecated in Keras.
The keras.models.load_model function is deprecated as well. Keras is now
embedded in Tensorflow 2 under tensorflow.keras.
languages: [python]
severity: WARNING
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [keras]
description: "Potential arbitrary code execution from Keras' load_model function"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern-either:
- pattern: keras.models.load_model(...)
- pattern: tensorflow.keras.models.load_model(...)
- pattern: keras.saving.load_model(...)
- pattern: tensorflow.keras.saving.load_model(...)
- pattern-not:
patterns:
- pattern-either:
- pattern: keras.models.load_model($FILE)
- pattern: tensorflow.keras.models.load_model($FILE)
- pattern: keras.saving.load_model($FILE)
- pattern: tensorflow.keras.saving.load_model($FILE)
- metavariable-regex:
metavariable: $FILE
regex: .*\.keras
20 changes: 20 additions & 0 deletions python/pickles-in-keras.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from tensorflow import keras
from keras.models import load_model

def id(x):
return x

h5_file_path = id("model.h5")
keras_file_path = id("model.keras")

# ok: pickles-in-keras
m1 = load_model("model.h5")

# ok: pickles-in-keras
m2 = load_model("model.keras")

# ruleid: pickles-in-keras
m3 = load_model(h5_file_path)

# ruleid: pickles-in-keras
m4 = load_model(keras_file_path)
36 changes: 36 additions & 0 deletions python/pickles-in-keras.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
rules:
- id: pickles-in-keras
message: >-
Keras' load_model function may result in arbitrary code execution:
- It can load vulnerable pickled models
- It can load an hdf5 model that contains a lambda layer with arbitrary code
that will be executed every time the model is used (loading, training, eval)
Note: Keras loading with the built-in file format should be safe as long as checks are not disabled.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [keras]
description: "Potential arbitrary code execution from Keras' load_model function"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern-either:
- patterns:
- pattern: keras.models.load_model(...)
- pattern-not: keras.models.load_model("...", ...)
- patterns:
- pattern: tensorflow.keras.models.load_model(...)
- pattern-not: tensorflow.keras.models.load_model("...", ...)
- patterns:
- pattern: keras.saving.load_model(...)
- pattern-not: keras.saving.load_model("...", ...)
- patterns:
- pattern: tensorflow.keras.saving.load_model(...)
- pattern-not: tensorflow.keras.saving.load_model("...", ...)
4 changes: 4 additions & 0 deletions python/pickles-in-pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,7 @@ def test(arg):

# ok: pickles-in-pytorch
model.load_state_dict(torch.load(arg))

state_dict = model.state_dict()
# ok: pickles-in-pytorch
torch.save(state_dict, arg)
9 changes: 8 additions & 1 deletion python/pickles-in-pytorch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,11 @@ rules:
- pattern-not: torch.load("...")
- pattern-not: torch.save(..., "...")
- pattern-not: torch.save($M.state_dict(), ...)
- pattern-not-inside: $M.load_state_dict(torch.load(...))
- pattern-not-inside: $M.load_state_dict(...)

- pattern-not:
patterns:
- pattern: torch.save($STATE_DICT, ...)
- pattern-inside: |
$STATE_DICT = $M.state_dict()
...
11 changes: 11 additions & 0 deletions python/pickles-in-tensorflow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import tensorflow as tf

def id(x):
return x

model_dir = id("model_dir")

# ok: pickles-in-tensorflow
m1 = tf.saved_model.load("model_dir")
# ruleid: pickles-in-tensorflow
m2 = tf.saved_model.load(model_dir)
21 changes: 21 additions & 0 deletions python/pickles-in-tensorflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
rules:
- id: pickles-in-tensorflow
message: >-
Tensorflow's low-level load function may result in arbitrary code execution.
languages: [python]
severity: ERROR
metadata:
category: security
cwe: "CWE-502: Deserialization of Untrusted Data"
subcategory: [vuln]
confidence: MEDIUM
likelihood: MEDIUM
impact: HIGH
technology: [keras]
description: "Potential arbitrary code execution from tensorflow's load function"
references:
- https://blog.trailofbits.com/2021/03/15/never-a-dill-moment-exploiting-machine-learning-pickle-files/

patterns:
- pattern: tensorflow.saved_model.load(...)
- pattern-not: tensorflow.saved_model.load("...", ...)
Loading