Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[analyzer] EvalBinOpLL should return Unknown less often #114222

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

steakhal
Copy link
Contributor

SValBuilder::getKnownValue, getMinValue, getMaxValue use SValBuilder::simplifySVal.

simplifySVal does repeated simplification until a fixed-point is reached. A single step is done by SimpleSValBuilder::simplifySValOnce, using a Simplifier visitor. That will basically decompose SymSymExprs, and apply constant folding using the constraints we have in the State. Once it decomposes a SymSymExpr, it simplifies both sides and then uses the SValBuilder::evalBinOp to reconstruct the same - but now simpler - SymSymExpr, while applying some caching to remain performant.

This decomposition, and then the subsequent re-composition poses new challenges to the SValBuilder::evalBinOp, which is built to handle expressions coming from real C/C++ code, thus applying some implicit assumptions.

One previous assumption was that nobody would form an expression like "((int*)0) - q" (where q is an int pointer), because it doesn't really makes sense to write code like that.

However, during simplification, we may end up with a call to evalBinOp similar to this.

To me, simplifying a SymbolRef should never result in Unknown or Undef, unless it was Unknown or Undef initially or, during simplification we realized that it's a division by zero once we did the constant folding, etc.

In the following case the simplified SVal should not become UnknownVal:

void top(char *p, char *q) {
  int diff = p - q; // diff: reg<p> - reg<q>
  if (!p) // p: NULL
    simplify(diff); // diff after simplification should be: 0(loc) - reg<q>
}

Returning Unknown from the simplifySVal can weaken analysis precision in other places too, such as in SValBuilder::getKnownValue, getMinValue, or getMaxValue because we call simplifySVal before doing anything else.

For nonloc::SymbolVals, this loss of precision is critical, because for those the SymbolRef carries an accurate type of the encoded computation, thus we should at least have a conservative upper or lower bound that we could return from getMinValue or getMaxValue - yet we would just return nullptr.

const llvm::APSInt *SimpleSValBuilder::getKnownValue(ProgramStateRef state,
                                                      SVal V) {
  return getConstValue(state, simplifySVal(state, V));
}

const llvm::APSInt *SimpleSValBuilder::getMinValue(ProgramStateRef state,
                                                    SVal V) {
  V = simplifySVal(state, V);

  if (const llvm::APSInt *Res = getConcreteValue(V))
    return Res;

  if (SymbolRef Sym = V.getAsSymbol())
    return state->getConstraintManager().getSymMinVal(state, Sym);

  return nullptr;
}

For now, I don't plan to make the simplification bullet-proof, I'm just explaining why I made this change and what you need to look out for in the future if you see a similar issue.

CPP-5750

SValBuilder::getKnownValue, getMinValue, getMaxValue use
SValBuilder::simplifySVal.

simplifySVal does repeated simplification until a fixed-point is
reached. A single step is done by SimpleSValBuilder::simplifySValOnce,
using a Simplifier visitor. That will basically decompose SymSymExprs,
and apply constant folding using the constraints we have in the State.
Once it decomposes a SymSymExpr, it simplifies both sides and then uses
the SValBuilder::evalBinOp to reconstruct the same - but now simpler -
SymSymExpr, while applying some caching to remain performant.

This decomposition, and then the subsequent re-composition poses new
challenges to the SValBuilder::evalBinOp, which is built to handle
expressions coming from real C/C++ code, thus applying some implicit
assumptions.

One previous assumption was that nobody would form an expression like
"((int*)0) - q" (where q is an int pointer), because it doesn't really
makes sense to write code like that.

However, during simplification, we may end up with a call to evalBinOp
similar to this.

To me, simplifying a SymbolRef should never result in Unknown or Undef,
unless it was Unknown or Undef initially or, during simplification we
realized that it's a division by zero once we did the constant folding,
etc.

In the following case the simplified SVal should not become UnknownVal:
```c++
void top(char *p, char *q) {
  int diff = p - q; // diff: reg<p> - reg<q>
  if (!p) // p: NULL
    simplify(diff); // diff after simplification should be: 0(loc) - reg<q>
}
```

Returning Unknown from the simplifySVal can weaken analysis precision in
other places too, such as in SValBuilder::getKnownValue, getMinValue, or
getMaxValue because we call simplifySVal before doing anything else.

For nonloc::SymbolVals, this loss of precision is critical, because for
those the SymbolRef carries an accurate type of the encoded computation,
thus we should at least have a conservative upper or lower bound that we
could return from getMinValue or getMaxValue - yet we would just return
nullptr.

```c++
const llvm::APSInt *SimpleSValBuilder::getKnownValue(ProgramStateRef state,
                                                      SVal V) {
  return getConstValue(state, simplifySVal(state, V));
}

const llvm::APSInt *SimpleSValBuilder::getMinValue(ProgramStateRef state,
                                                    SVal V) {
  V = simplifySVal(state, V);

  if (const llvm::APSInt *Res = getConcreteValue(V))
    return Res;

  if (SymbolRef Sym = V.getAsSymbol())
    return state->getConstraintManager().getSymMinVal(state, Sym);

  return nullptr;
}
```

For now, I don't plan to make the simplification bullet-proof, I'm just
explaining why I made this change and what you need to look out for in
the future if you see a similar issue.

CPP-5750
@llvmbot llvmbot added the clang Clang issues not falling into any other category label Oct 30, 2024
@steakhal
Copy link
Contributor Author

@necto

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 30, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-static-analyzer-1

Author: Balazs Benics (steakhal)

Changes

SValBuilder::getKnownValue, getMinValue, getMaxValue use SValBuilder::simplifySVal.

simplifySVal does repeated simplification until a fixed-point is reached. A single step is done by SimpleSValBuilder::simplifySValOnce, using a Simplifier visitor. That will basically decompose SymSymExprs, and apply constant folding using the constraints we have in the State. Once it decomposes a SymSymExpr, it simplifies both sides and then uses the SValBuilder::evalBinOp to reconstruct the same - but now simpler - SymSymExpr, while applying some caching to remain performant.

This decomposition, and then the subsequent re-composition poses new challenges to the SValBuilder::evalBinOp, which is built to handle expressions coming from real C/C++ code, thus applying some implicit assumptions.

One previous assumption was that nobody would form an expression like "((int*)0) - q" (where q is an int pointer), because it doesn't really makes sense to write code like that.

However, during simplification, we may end up with a call to evalBinOp similar to this.

To me, simplifying a SymbolRef should never result in Unknown or Undef, unless it was Unknown or Undef initially or, during simplification we realized that it's a division by zero once we did the constant folding, etc.

In the following case the simplified SVal should not become UnknownVal:

void top(char *p, char *q) {
  int diff = p - q; // diff: reg&lt;p&gt; - reg&lt;q&gt;
  if (!p) // p: NULL
    simplify(diff); // diff after simplification should be: 0(loc) - reg&lt;q&gt;
}

Returning Unknown from the simplifySVal can weaken analysis precision in other places too, such as in SValBuilder::getKnownValue, getMinValue, or getMaxValue because we call simplifySVal before doing anything else.

For nonloc::SymbolVals, this loss of precision is critical, because for those the SymbolRef carries an accurate type of the encoded computation, thus we should at least have a conservative upper or lower bound that we could return from getMinValue or getMaxValue - yet we would just return nullptr.

const llvm::APSInt *SimpleSValBuilder::getKnownValue(ProgramStateRef state,
                                                      SVal V) {
  return getConstValue(state, simplifySVal(state, V));
}

const llvm::APSInt *SimpleSValBuilder::getMinValue(ProgramStateRef state,
                                                    SVal V) {
  V = simplifySVal(state, V);

  if (const llvm::APSInt *Res = getConcreteValue(V))
    return Res;

  if (SymbolRef Sym = V.getAsSymbol())
    return state-&gt;getConstraintManager().getSymMinVal(state, Sym);

  return nullptr;
}

For now, I don't plan to make the simplification bullet-proof, I'm just explaining why I made this change and what you need to look out for in the future if you see a similar issue.

CPP-5750


Full diff: https://github.com/llvm/llvm-project/pull/114222.diff

3 Files Affected:

  • (modified) clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp (+4-3)
  • (modified) clang/unittests/StaticAnalyzer/CMakeLists.txt (+1)
  • (added) clang/unittests/StaticAnalyzer/SValSimplifyerTest.cpp (+103)
diff --git a/clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp b/clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
index 45e48d435aca6a..229169f848e228 100644
--- a/clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
+++ b/clang/lib/StaticAnalyzer/Core/SimpleSValBuilder.cpp
@@ -860,11 +860,12 @@ SVal SimpleSValBuilder::evalBinOpLL(ProgramStateRef state,
     // If one of the operands is a symbol and the other is a constant,
     // build an expression for use by the constraint manager.
     if (SymbolRef rSym = rhs.getAsLocSymbol()) {
-      // We can only build expressions with symbols on the left,
-      // so we need a reversible operator.
-      if (!BinaryOperator::isComparisonOp(op) || op == BO_Cmp)
+      if (op == BO_Cmp)
         return UnknownVal();
 
+      if (!BinaryOperator::isComparisonOp(op))
+        return makeNonLoc(L.getValue(), op, rSym, resultTy);
+
       op = BinaryOperator::reverseComparisonOp(op);
       return makeNonLoc(rSym, op, L.getValue(), resultTy);
     }
diff --git a/clang/unittests/StaticAnalyzer/CMakeLists.txt b/clang/unittests/StaticAnalyzer/CMakeLists.txt
index 5ef72cfaea4011..f5da86e5456030 100644
--- a/clang/unittests/StaticAnalyzer/CMakeLists.txt
+++ b/clang/unittests/StaticAnalyzer/CMakeLists.txt
@@ -20,6 +20,7 @@ add_clang_unittest(StaticAnalysisTests
   RegisterCustomCheckersTest.cpp
   StoreTest.cpp
   SymbolReaperTest.cpp
+  SValSimplifyerTest.cpp
   SValTest.cpp
   TestReturnValueUnderConstruction.cpp
   Z3CrosscheckOracleTest.cpp
diff --git a/clang/unittests/StaticAnalyzer/SValSimplifyerTest.cpp b/clang/unittests/StaticAnalyzer/SValSimplifyerTest.cpp
new file mode 100644
index 00000000000000..b3feb0e4cce231
--- /dev/null
+++ b/clang/unittests/StaticAnalyzer/SValSimplifyerTest.cpp
@@ -0,0 +1,103 @@
+//===- unittests/StaticAnalyzer/SValSimplifyerTest.cpp --------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "CheckerRegistration.h"
+#include "clang/StaticAnalyzer/Core/BugReporter/BugReporter.h"
+#include "clang/StaticAnalyzer/Core/BugReporter/BugType.h"
+#include "clang/StaticAnalyzer/Core/Checker.h"
+#include "clang/StaticAnalyzer/Core/PathSensitive/CallDescription.h"
+#include "clang/StaticAnalyzer/Core/PathSensitive/CheckerContext.h"
+#include "clang/StaticAnalyzer/Frontend/AnalysisConsumer.h"
+#include "clang/StaticAnalyzer/Frontend/CheckerRegistry.h"
+#include "llvm/ADT/Twine.h"
+#include "llvm/Support/raw_ostream.h"
+#include "gtest/gtest.h"
+
+using namespace clang;
+using namespace ento;
+
+static std::string toString(SVal V) {
+  std::string Result;
+  llvm::raw_string_ostream Stream(Result);
+  V.dumpToStream(Stream);
+  return Result;
+}
+
+static void replace(std::string &Content, StringRef Substr,
+                    StringRef Replacement) {
+  std::size_t Pos = 0;
+  while ((Pos = Content.find(Substr, Pos)) != std::string::npos) {
+    Content.replace(Pos, Substr.size(), Replacement);
+    Pos += Replacement.size();
+  }
+}
+
+namespace {
+
+class SimplifyChecker : public Checker<check::PreCall> {
+  const BugType Bug{this, "SimplifyChecker"};
+  const CallDescription SimplifyCall{CDM::SimpleFunc, {"simplify"}, 1};
+
+  void report(CheckerContext &C, const Expr *E, StringRef Description) const {
+    PathDiagnosticLocation Loc(E->getExprLoc(), C.getSourceManager(), E);
+    auto Report = std::make_unique<BasicBugReport>(Bug, Description, Loc);
+    C.emitReport(std::move(Report));
+  }
+
+public:
+  void checkPreCall(const CallEvent &Call, CheckerContext &C) const {
+    if (!SimplifyCall.matches(Call))
+      return;
+    const Expr *Arg = Call.getArgExpr(0);
+    SVal Val = C.getSVal(Arg);
+    SVal SimplifiedVal = C.getSValBuilder().simplifySVal(C.getState(), Val);
+    std::string Subject = toString(Val);
+    std::string Simplified = toString(SimplifiedVal);
+    std::string Message = (llvm::Twine{Subject} + " -> " + Simplified).str();
+    report(C, Arg, Message);
+  }
+};
+} // namespace
+
+static void addSimplifyChecker(AnalysisASTConsumer &AnalysisConsumer,
+                               AnalyzerOptions &AnOpts) {
+  AnOpts.CheckersAndPackages = {{"SimplifyChecker", true}};
+  AnalysisConsumer.AddCheckerRegistrationFn([](CheckerRegistry &Registry) {
+    Registry.addChecker<SimplifyChecker>("SimplifyChecker", "EmptyDescription",
+                                         "EmptyDocsUri");
+  });
+}
+
+static void runThisCheckerOnCode(const std::string &Code, std::string &Diags) {
+  ASSERT_TRUE(runCheckerOnCode<addSimplifyChecker>(Code, Diags,
+                                                   /*OnlyEmitWarnings=*/true));
+  ASSERT_FALSE(Diags.empty());
+  ASSERT_EQ(Diags.back(), '\n');
+  Diags.pop_back();
+}
+
+namespace {
+
+TEST(SValSimplifyerTest, LHSConstrainedNullPtrDiff) {
+  constexpr auto Code = R"cpp(
+template <class T> void simplify(T);
+void LHSConstrainedNullPtrDiff(char *p, char *q) {
+  int diff = p - q;
+  if (!p)
+    simplify(diff);
+})cpp";
+
+  std::string Diags;
+  runThisCheckerOnCode(Code, Diags);
+  replace(Diags, "(reg_$0<char * p>)", "reg_p");
+  replace(Diags, "(reg_$1<char * q>)", "reg_q");
+  // This should not be simplified to "Unknown".
+  EXPECT_EQ(Diags, "SimplifyChecker: reg_p - reg_q -> 0U - reg_q");
+}
+
+} // namespace

Copy link
Collaborator

@Xazax-hun Xazax-hun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense for me. It would be nice if we had a minimal reproducer for a regression test instead of maintaining a large-ish test.

@Xazax-hun
Copy link
Collaborator

To me, simplifying a SymbolRef should never result in Unknown or Undef, unless it was Unknown or Undef initially or, during simplification we realized that it's a division by zero once we did the constant folding, etc.

I understand that we might not be ready for this, but feels like at some point we should have this as a form of an assertion. I am not sure about the Undef part because there are many ways to get to undef during constant folding (signed overflows, division by zero, OOB access of an array, defer of an invalid pointer). Even if some of those cannot happen during the current simplification they might be modeled later. So, I think a simpler "the result can only be Unknown if the input was also Unknown" would make a lot of sense to me.

@steakhal
Copy link
Contributor Author

Makes sense for me. It would be nice if we had a minimal reproducer for a regression test instead of maintaining a large-ish test.

I wish I could add a LIT test instead.
Out of the get min/max val, or getKnownValue APIs, only the latter is used slightly more broadly, where the callsites can't assume it should return a valid result. Still, these APIs are barely used.
In my case I have additional context in my downstream checker, where I can make this assumption and I was surprised to see crashes. So I had no other options but to craft a unittest for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants