Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow due to excessive recursion when using DomainBuilder #174

Open
racko opened this issue Mar 22, 2023 · 6 comments
Open

Stack overflow due to excessive recursion when using DomainBuilder #174

racko opened this issue Mar 22, 2023 · 6 comments

Comments

@racko
Copy link

racko commented Mar 22, 2023

I was trying to define a domain of somewhat simple regexes:

// E -> E '|' E | T
// T -> T T | F
// F -> F* | F? | G
// G -> (E) | \\H | [characters]
// H -> \\ | '|' | * | ( | )
auto ArbitraryRegex()
{
    DomainBuilder builder;
    builder.Set<std::string>("E",
                             OneOf(Map([](const std::string& e1, const std::string& e2) { return e1 + '|' + e2; },
                                       builder.Get<std::string>("E"),
                                       builder.Get<std::string>("E")),
                                   builder.Get<std::string>("T")));
    builder.Set<std::string>("T",
                             OneOf(Map([](const std::string& e1, const std::string& e2) { return e1 + e2; },
                                       builder.Get<std::string>("T"),
                                       builder.Get<std::string>("T")),
                                   builder.Get<std::string>("F")));
    builder.Set<std::string>("F",
                             OneOf(Map([](const std::string& f) { return f + '*'; }, builder.Get<std::string>("F")),
                                   Map([](const std::string& f) { return f + '?'; }, builder.Get<std::string>("F")),
                                   builder.Get<std::string>("G")));
    builder.Set<std::string>(
        "G",
        OneOf(Map([](const std::string& e) { return '(' + e + ')'; }, builder.Get<std::string>("E")),
              Map([](const char h) { return std::string{"\\"} + h; }, builder.Get<char>("H")),
              StringOf(AlphaNumericChar()).WithSize(1)));
    builder.Set<char>("H", ElementOf({'\\', '|', '*', '(', ')'}));
    return std::move(builder).Finalize<std::string>("E");
}

The recursion in this domain blows up even 100MB of stack (ulimit -s 100000) almost immediately after starting fuzzing runs. Here's the bottom of the stack trace:

...
#355197 0x00005555557ed20e in fuzztest::internal::DomainBase<fuzztest::internal::AggregateOfImpl<std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, (fuzztest::internal::RequireCustomCorpusType)0, fuzztest::Domain<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::tuple<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::UntypedInit (
    this=0x6020000030f0, ref=...) at external/com_google_fuzztest/./fuzztest/internal/domains/domain_base.h:133
#355198 0x0000555555866022 in fuzztest::internal::FuzzTestFuzzerImpl::InitializeCorpus (this=0x613000000040, prng=...)
    at external/com_google_fuzztest/fuzztest/internal/runtime.cc:555
#355199 0x0000555555867290 in fuzztest::internal::FuzzTestFuzzerImpl::RunInFuzzingMode(int*, char***)::$_5::operator()() const (this=0x7fffffffd210)
    at external/com_google_fuzztest/fuzztest/internal/runtime.cc:743
#355200 0x0000555555866fe3 in fuzztest::internal::FuzzTestFuzzerImpl::RunInFuzzingMode (this=0x613000000040)
    at external/com_google_fuzztest/fuzztest/internal/runtime.cc:709
#355201 0x0000555555856ee9 in fuzztest::internal::GTest_TestAdaptor::TestBody (this=0x604000000c50)
    at external/com_google_fuzztest/./fuzztest/googletest_adaptor.h:60
#355202 0x0000555555bb394b in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x604000000c50, 
    method=&virtual testing::Test::TestBody(), location=0x555555c0bf20 "the test body") at external/com_google_googletest/googletest/src/gtest.cc:2599
#355203 0x0000555555b96cad in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x604000000c50, 
    method=&virtual testing::Test::TestBody(), location=0x555555c0bf20 "the test body") at external/com_google_googletest/googletest/src/gtest.cc:2635
#355204 0x0000555555b78f63 in testing::Test::Run (this=0x604000000c50) at external/com_google_googletest/googletest/src/gtest.cc:2674
#355205 0x0000555555b79bad in testing::TestInfo::Run (this=0x612000000640) at external/com_google_googletest/googletest/src/gtest.cc:2853
#355206 0x0000555555b7a44a in testing::TestSuite::Run (this=0x6120000007c0) at external/com_google_googletest/googletest/src/gtest.cc:3012
#355207 0x0000555555b8c6ff in testing::internal::UnitTestImpl::RunAllTests (this=0x616000005780)
    at external/com_google_googletest/googletest/src/gtest.cc:5870
#355208 0x0000555555bb422b in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x616000005780, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x555555b8c280 <testing::internal::UnitTestImpl::RunAllTests()>, location=0x555555c0c884 "auxiliary test code (environments or event listeners)") at external/com_google_googletest/googletest/src/gtest.cc:2599
#355209 0x0000555555b99633 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x616000005780, 
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x555555b8c280 <testing::internal::UnitTestImpl::RunAllTests()>, location=0x555555c0c884 "auxiliary test code (environments or event listeners)") at external/com_google_googletest/googletest/src/gtest.cc:2635
#355210 0x0000555555b8c218 in testing::UnitTest::Run (this=0x5555566bdcb8 <testing::UnitTest::GetInstance()::instance>)
    at external/com_google_googletest/googletest/src/gtest.cc:5444
#355211 0x0000555555854861 in RUN_ALL_TESTS () at external/com_google_googletest/googletest/include/gtest/gtest.h:2293
#355212 0x0000555555851bae in main (argc=2, argv=0x7fffffffdba8) at external/com_google_fuzztest/fuzztest/fuzztest_gtest_main.cc:76

I tried to see if simplifying the domain definition helps, and it does. The following takes a few seconds more to blow up:

auto ArbitraryRecursive()
{
    DomainBuilder builder;
    builder.Set<std::string>("E",
                             OneOf(Map([](const std::string& e1, const std::string& e2) { return e1 + '|' + e2; },
                                       builder.Get<std::string>("E"),
                                       builder.Get<std::string>("E")),
                                   Just(std::string{})));
    return std::move(builder).Finalize<std::string>("E");
}

And the following doesn't run into stack overflows at all:

auto ArbitraryRecursive()
{
    DomainBuilder builder;
    builder.Set<std::string>("E",
                             OneOf(Map([](const std::string& inner) { return inner + '*'; },
                                       builder.Get<std::string>("E")),
                                   Just(std::string{})));
    return std::move(builder).Finalize<std::string>("E");
}

I'm thinking about how I could limit the recursion by changing the domain definition (e.g. introduce a "max depth" parameter. But I can't see how, because I can't pass additional parameters to DomainBuilder::Get.

If Get took additional parameters and Set could take a function (instead of the domain) which in turn takes these parameters and returns a domain, then I could consciously limit recursion, I think 🤔

@racko
Copy link
Author

racko commented Jan 16, 2024

I came across fuzztest grammars and noticed that they support a generation budget to limit recursion:

// Use for random AST generation. To avoid inifite recursive generation (i.e.,
// for grammar rules like `expr: expr '+' expr | literal`), we limit the
// generation with budget. We first assign the budget to be `kMaxGenerationNum`.
// Every time we generate a AST node, we decrement the budget by one. When we
// are running out of the budget (the budget is less or equal than 0), the
// generation will be in FallBack mode. The FallBack Mode ensures the generation
// ends. Specifically, every grammar rule that has more than 1 production rules
// has a fallback index. When every grammar rule chooses the fallback index
// during generation (aka, the FallBack Mode is on), generation will guarantee
// to end. The fallback index is precalculated by the code generator. The idea
// is simple: We define a symbol (terminal or non-terminal) as safe if it
// generates a finite string in the fallback mode. First we mark all terminals
// as safe. If a non-terminal has a production rule that consists of only safe
// symbols, we use the index of production rule as the fallback index and mark
// the non-terminal is safe. We repeat the process until every grammar rule has
// a fallback index.
inline constexpr int kMaxGenerationNum = 200;

Grammars don't seem to be documented as far as I can tell. Here's what I found:

A quick attempt at using this feature for my problem failed because the bazel macro can't be used outside of the fuzztest workspace:

ERROR: no such package 'fuzztest': BUILD file not found in any of the following directories. Add a BUILD file to a directory to mark it as a package.
 - /path/to/my/workspace/fuzztest
ERROR: /path/to/my/workspace/some/package/BUILD:122:28: no such package 'fuzztest': BUILD file not found in any of the following directories. Add a BUILD file to a directory to mark it as a package.
 - /path/to/my/workspace/fuzztest and referenced by '//some/package:regex_grammar'
ERROR: Analysis of target '//some/package:regex_grammar' failed; build aborted: Analysis failed

deps = ["//fuzztest:domain"],

Bazel tries to resolve the //fuzztest reference to a package within my workspace instead of @com_google_fuzztest//fuzztest

I fixed it with

diff --git a/build_defs/cc_fuzztest_grammar_library.bzl b/build_defs/cc_fuzztest_grammar_library.bzl
index ea8de65..657958c 100644
--- a/build_defs/cc_fuzztest_grammar_library.bzl
+++ b/build_defs/cc_fuzztest_grammar_library.bzl
@@ -33,7 +33,7 @@ def cc_fuzztest_grammar_library(name, srcs, top_level_rule = None):
     """
 
     output_file_name = name + ".h"
-    cmd = "$(location //tools:grammar_domain_code_generator)" + \
+    cmd = "$(location @com_google_fuzztest//tools:grammar_domain_code_generator)" + \
           " --output_header_file_path " + "$(@D)/" + output_file_name + \
           " --input_grammar_files " + "`echo $(SRCS) | tr ' ' ','`"
     if top_level_rule:
@@ -45,11 +45,11 @@ def cc_fuzztest_grammar_library(name, srcs, top_level_rule = None):
         outs = [output_file_name],
         cmd = cmd,
         heuristic_label_expansion = False,
-        tools = ["//tools:grammar_domain_code_generator"],
+        tools = ["@com_google_fuzztest//tools:grammar_domain_code_generator"],
     )
 
     native.cc_library(
         name = name,
         hdrs = [output_file_name],
-        deps = ["//fuzztest:domain"],
+        deps = ["@com_google_fuzztest//fuzztest:domain"],
     )

Then I created the following grammar:

// regex.g4
grammar REGEX_GRAMMAR;

E : E '|' E | T ;
T : T T | F ;
F : F '*' | F '?' | G ;
G : '(' E ')' | '\\' H | [a-zA-Z0-9_] ;
H : '\\' | '|' | '*' | '(' | ')' ;

WSPACE : [ \t\n\r]+ -> skip;

Used

# some/package/BUILD
load("@com_google_fuzztest//build_defs:cc_fuzztest_grammar_library.bzl", "cc_fuzztest_grammar_library")


cc_fuzztest_grammar_library(
    name = "regex_grammar",                                                    
    srcs = ["regex.g4"],                                                       
    top_level_rule = "E",                                                      
)

cc_test(                                                                       
    name = "fuzztest",                                             
    size = "small",                                                            
    srcs = ["fuzztest.cpp"],                                              
    deps = [                                                                   
        # ...                                                              
        "@com_google_fuzztest//fuzztest",                                      
        "@com_google_fuzztest//fuzztest:fuzztest_gtest_main",                            
        "@com_google_googletest//:gtest",                                      
        ":regex_grammar",                                                      
    ],                                                                         
)

And finally

/// @file fuzztest.cpp
#include "some/package/regex_grammar.h"

// ...

FUZZ_TEST(Fixture, Test).WithDomains(fuzztest::InEGrammar());

And it works 🙂
At least I got a test run that looked like it did what it's supposed to do.

With --config=fuzztest I got a different kind of stack related error. This time:

*** stack smashing detected ***: terminated

Program received signal SIGABRT, Aborted
0x00007ffff7a8783c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7a8783c in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7a37668 in raise () from /usr/lib/libc.so.6
#2  0x00007ffff7a1f4b8 in abort () from /usr/lib/libc.so.6
#3  0x00007ffff7a20390 in ?? () from /usr/lib/libc.so.6
#4  0x00007ffff7b17b4b in __fortify_fail () from /usr/lib/libc.so.6
#5  0x00007ffff7b18e56 in __stack_chk_fail () from /usr/lib/libc.so.6
#6  0x0000555555b71fb5 in absl::container_internal::raw_hash_set<absl::container_internal::FlatHashMapPolicy<int, int>, absl::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >::raw_hash_set (this=0x7fffffff7e28)
    at external/com_google_absl/absl/container/internal/raw_hash_set.h:3206
#7  0x0000555555b71f15 in absl::container_internal::raw_hash_map<absl::container_internal::FlatHashMapPolicy<int, int>, absl::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >::raw_hash_map (
    this=<error reading variable: Cannot access memory at address 0xfffffffffffffff8>)
    at external/com_google_absl/absl/container/internal/raw_hash_map.h:64
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I need to investigate more when I find the time.

I won't close this issue because using DomainBuilder like this still results in stack overflows.

@racko
Copy link
Author

racko commented Jan 17, 2024

I didn't get to investigating the stack smashing yet, but I found other things:

My understanding of antlr grammars was flawed: The above only has "lexer rules" but no "parser rules". There were some more issues which at least http://lab.antlr.org/ didn't accept. Now I use

grammar REGEX_GRAMMAR;
e : t ( '|' t )* ;
t : f f* ;
f : f '*' | f '?' | g ;
g : '(' e ')' | '\\' h | CHAR ;
h : '\\' | '|' | '*' | '(' | ')';

CHAR : [a-zA-Z0-9_] ;

which is accepted by both http://lab.antlr.org/ and fuzztest - at least after some fixes:

@racko
Copy link
Author

racko commented Jan 19, 2024

Using https://stackoverflow.com/a/51897264/8857097 I was able to narrow down the stack smashing problem.

ABSL_SWISSTABLE_ENABLE_GENERATIONS has some influence on the problem: It is only set in sanitizer builds and determines whether CommonFields inherits from an empty base class (without sanitizers) or CommonFieldsGenerationInfoEnabled (with sanitizers).

CommonFieldsGenerationInfoEnabled has three fields:

  • size_t reserved_growth_
  • size_t reservation_size_
  • GenerationType* generation_

Totalling 24 Bytes.

Ignoring the base class, CommonFields has four fields:

  • ctrl_t* control_
  • void* slots_
  • size_t capacity_
  • size_t size_

Totalling 32 Bytes.

But I see in the assembly that always this code is generated in the raw_hash_set constructor:

   0x00007ffff7b36630 <+0>:	push   rbp
   0x00007ffff7b36631 <+1>:	mov    rbp,rsp
   0x00007ffff7b36634 <+4>:	sub    rsp,0x70
   0x00007ffff7b36638 <+8>:	mov    rax,QWORD PTR fs:0x28
   0x00007ffff7b36641 <+17>:	mov    QWORD PTR [rbp-0x8],rax
...
   0x00007ffff7b3665c <+44>:	lea    rdi,[rbp-0x30]
   0x00007ffff7b36660 <+48>:	call   0x7ffff7b1c080 <absl::container_internal::CommonFields::CommonFields()@plt>

At the start we see the preamble, where the stack canary is stored at rbp-0x8.
Then we see the call to the CommonFields constructor. rdi is the first argument, i.e. this. Its value here is rbp-0x30, which leaves 0x28 = 40 Bytes for the CommonFields object. This is more than enough for the empty base class case but not enough for the other case, when CommonFields is 56 Bytes in size, including the base class.

The CommonFields constructor assembler code looks as expected.

With empty base class:

Dump of assembler code for function _ZN4absl18container_internal12CommonFieldsC2Ev:
=> 0x00005555555c3ea0 <+0>:	push   rbp
   0x00005555555c3ea1 <+1>:	mov    rbp,rsp
   0x00005555555c3ea4 <+4>:	sub    rsp,0x10
   0x00005555555c3ea8 <+8>:	mov    QWORD PTR [rbp-0x8],rdi
   0x00005555555c3eac <+12>:	mov    rax,QWORD PTR [rbp-0x8]
   0x00005555555c3eb0 <+16>:	mov    QWORD PTR [rbp-0x10],rax
   0x00005555555c3eb4 <+20>:	call   0x5555555c3f80 <absl::container_internal::EmptyGroup()>
   0x00005555555c3eb9 <+25>:	mov    rcx,rax
   0x00005555555c3ebc <+28>:	mov    rax,QWORD PTR [rbp-0x10]
   0x00005555555c3ec0 <+32>:	mov    QWORD PTR [rax],rcx
   0x00005555555c3ec3 <+35>:	mov    QWORD PTR [rax+0x8],0x0
   0x00005555555c3ecb <+43>:	mov    QWORD PTR [rax+0x10],0x0
   0x00005555555c3ed3 <+51>:	mov    QWORD PTR [rax+0x18],0x0
   0x00005555555c3edb <+59>:	add    rsp,0x10
   0x00005555555c3edf <+63>:	pop    rbp
   0x00005555555c3ee0 <+64>:	ret

We see that rdi is initialized to the return value of EmptyGroup() and rdi+0x8, rdi+0x10 and rdi+0x18 are initialized to 0. Given that rdi was set to rbp-0x30 in the raw_hash_set constructor, the stack canary remains untouched.

And the other case:

Dump of assembler code for function _ZN4absl18container_internal12CommonFieldsC2Ev:
...
   0x000055555571b51e <+30>:	mov    QWORD PTR [rbp-0x20],rdi
   0x000055555571b522 <+34>:	call   0x55555571b6a0 <absl::container_internal::CommonFieldsGenerationInfoEnabled::CommonFieldsGenerationInfoEnabled()>
   0x000055555571b527 <+39>:	mov    rax,QWORD PTR [rbp-0x20]
   0x000055555571b52b <+43>:	add    rax,0x18
   0x000055555571b52f <+47>:	mov    QWORD PTR [rbp-0x18],rax
   0x000055555571b533 <+51>:	call   0x55555571b760 <absl::container_internal::EmptyGroup()>
   0x000055555571b538 <+56>:	mov    rcx,rax
...
   0x000055555571b53f <+63>:	mov    QWORD PTR [rbp-0x10],rcx
...
   0x000055555571b55d <+93>:	mov    rax,QWORD PTR [rbp-0x20]
   0x000055555571b561 <+97>:	mov    rcx,QWORD PTR [rbp-0x18]
   0x000055555571b565 <+101>:	mov    rdx,QWORD PTR [rbp-0x10]
   0x000055555571b569 <+105>:	mov    QWORD PTR [rcx],rdx
   0x000055555571b56c <+108>:	add    rax,0x20
   0x000055555571b570 <+112>:	mov    QWORD PTR [rbp-0x28],rax
...
   0x000055555571b58e <+142>:	mov    rax,QWORD PTR [rbp-0x20]
   0x000055555571b592 <+146>:	mov    rcx,QWORD PTR [rbp-0x28]
   0x000055555571b596 <+150>:	mov    QWORD PTR [rcx],0x0
   0x000055555571b59d <+157>:	add    rax,0x28
   0x000055555571b5a1 <+161>:	mov    QWORD PTR [rbp-0x30],rax
...
   0x000055555571b5bf <+191>:	mov    rax,QWORD PTR [rbp-0x20]
   0x000055555571b5c3 <+195>:	mov    rcx,QWORD PTR [rbp-0x30]
   0x000055555571b5c7 <+199>:	mov    QWORD PTR [rcx],0x0
   0x000055555571b5ce <+206>:	add    rax,0x30
   0x000055555571b5d2 <+210>:	mov    QWORD PTR [rbp-0x38],rax
...
   0x000055555571b5f0 <+240>:	mov    rax,QWORD PTR [rbp-0x38]
   0x000055555571b5f4 <+244>:	mov    QWORD PTR [rax],0x0

It's very convoluted due to the asan checks which I have omitted here, but we see that rdi+0x18 is initialized to the return value of EmptyGroup() and rdi+0x20, rdi+0x28 and rdi+0x30 are initialized to 0. Given that rdi was set to rbp-0x30 in the raw_hash_set constructor, the last two writes overwrite the stack canary and return address 😕

Interestingly, the bug occurs in

absl::container_internal::raw_hash_map<absl::container_internal::FlatHashMapPolicy<int, int>, absl::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > >::raw_hash_map

but not in

absl::container_internal::raw_hash_map<absl::container_internal::FlatHashMapPolicy<std::basic_string_view<char, std::char_traits<char> >, absl::CommandLineFlag*>, absl::container_internal::StringHash, absl::container_internal::StringEq, std::allocator<std::pair<std::basic_string_view<char, std::char_traits<char> > const, absl::CommandLineFlag*> > >::raw_hash_map

In the second case very different assembly is generated and I'm having a tough time figuring out what is going on . Various asan references are noticable, which are not found in the broken case 😕

Smells like an ODR violation: Somehow a library that includes raw_hash_map but has not been built with asan is pulled in and the linker is free to choose this one when deduplicating the weak symbols.

@racko
Copy link
Author

racko commented Jan 19, 2024

I figured that it could only be related to the missing --copt=-fsanitize=address in fuzztest.bazelrc, which I noticed before, or bazel caching. So I add --copt=-fsanitize=address and disabled the bazel disk cache and lo and behold: It worked.
I removed the --copt=-fsanitize=address and it failed again, so caching was not the issue.

I reran bazel run @com_google_fuzztest//bazel:setup_configs > fuzztest.bazelrc and found that now (since 8bd9f89) build:fuzztest-common --copt=-fsanitize=address is already added by default 🙈

So now I have no problem with grammar fuzzing anymore 🙂
DomainBuilder is still broken, but I have something in mind here as well 🤔

@racko
Copy link
Author

racko commented Jan 20, 2024

I experimented with workarounds for DomainBuilder's deficiency because with grammars you can only generate strings. Thus for other recursive domains a solution would still be helpful. I will describe three alternatives.

Limited Strict Recursion

The basic idea in all alternatives is that we provide separate, recursively defined domains, i.e functions returning domains.

This allows us, for example to pass a size argument to the function that we can decrease in recursive calls. By providing a recursion base, we can avoid infinite recursion. And by choosing the size argument appropriately, we can avoid stack overflows:

namespace strict
{
fuzztest::Domain<std::string> ArbitraryE(std::size_t size);

fuzztest::Domain<char> ArbitraryH() { return ElementOf({'\\', '|', '*', '(', ')'}); }

fuzztest::Domain<std::string> ArbitraryG(const std::size_t size)
{
    if (size == 0)
    {
        return OneOf(Map([](const char h) { return std::string{"\\"} + h; }, ArbitraryH()),
                     StringOf(AlphaNumericChar()).WithSize(1));
    }
    return OneOf(Map([](std::string&& e) { return '(' + std::move(e) + ')'; }, ArbitraryE(size - 1)),
                 Map([](const char h) { return std::string{"\\"} + h; }, ArbitraryH()),
                 StringOf(AlphaNumericChar()).WithSize(1));
}

fuzztest::Domain<std::string> ArbitraryF(const std::size_t size)
{
    if (size == 0)
    {
        return OneOf(ArbitraryG(size));
    }
    return OneOf(Map([](std::string&& f) { return std::move(f) + '*'; }, ArbitraryF(size - 1)),
                 Map([](std::string&& f) { return std::move(f) + '?'; }, ArbitraryF(size - 1)),
                 ArbitraryG(size));
}

fuzztest::Domain<std::string> ArbitraryT(const std::size_t size)
{
    if (size == 0)
    {
        return OneOf(ArbitraryF(size));
    }
    return OneOf(Map([](std::string&& e1, std::string&& e2) { return std::move(e1) + std::move(e2); },
                     ArbitraryT(size - 1),
                     ArbitraryT(size - 1)),
                 ArbitraryF(size));
}

fuzztest::Domain<std::string> ArbitraryE(const std::size_t size)
{
    if (size == 0)
    {
        return OneOf(ArbitraryT(size));
    }
    return OneOf(Map([](std::string&& e1, std::string&& e2) { return std::move(e1) + '|' + std::move(e2); },
                     ArbitraryE(size - 1),
                     ArbitraryE(size - 1)),
                 ArbitraryT(size));
}
} // namespace strict

FUZZ_TEST(Foo, Bar).WithDomains(strict::ArbitraryE(10));

A downside of this version is that the construction of the domain is "strict", not "lazy": Before fuzzing can even begin fuzzing, just the initial call to strict::ArbitraryE(10) has to recurse 10 levels and the constructed domain has to mirror this tree structure, so it's slow.

Limited Lazy Recursion

We can improve by creating the domains lazily. Here FlatMap comes in handy: Calling FlatMap(f, inner_domain) will not immediately invoke f. Instead during sampling, a value is generated from inner_domain, which is then passed to f to construct the domain to sample from. Thus calling lazy::ArbitraryE(10) below will only recurse during sampling, not during construction, as desired:

namespace lazy
{
fuzztest::Domain<std::string> ArbitraryE(std::size_t size);

fuzztest::Domain<char> ArbitraryH() { return ElementOf({'\\', '|', '*', '(', ')'}); }

fuzztest::Domain<std::string> ArbitraryG(const std::size_t s)
{
    return FlatMap(
        [](const std::size_t size) -> fuzztest::Domain<std::string> {
            if (size == 0)
            {
                return OneOf(Map([](const char h) { return std::string{"\\"} + h; }, ArbitraryH()),
                             StringOf(AlphaNumericChar()).WithSize(1));
            }
            return OneOf(Map([](std::string&& e) { return '(' + std::move(e) + ')'; }, ArbitraryE(size - 1)),
                         Map([](const char h) { return std::string{"\\"} + h; }, ArbitraryH()),
                         StringOf(AlphaNumericChar()).WithSize(1));
        },
        Just(s));
}

fuzztest::Domain<std::string> ArbitraryF(const std::size_t s)
{
    return FlatMap(
        [](const std::size_t size) -> fuzztest::Domain<std::string> {
            if (size == 0)
            {
                return OneOf(ArbitraryG(size));
            }
            return OneOf(Map([](std::string&& f) { return std::move(f) + '*'; }, ArbitraryF(size - 1)),
                         Map([](std::string&& f) { return std::move(f) + '?'; }, ArbitraryF(size - 1)),
                         ArbitraryG(size));
        },
        Just(s));
}

fuzztest::Domain<std::string> ArbitraryT(const std::size_t s)
{
    return FlatMap(
        [](const std::size_t size) -> fuzztest::Domain<std::string> {
            if (size == 0)
            {
                return OneOf(ArbitraryF(size));
            }
            return OneOf(Map([](std::string&& e1, std::string&& e2) { return std::move(e1) + std::move(e2); },
                             ArbitraryT(size - 1),
                             ArbitraryT(size - 1)),
                         ArbitraryF(size));
        },
        Just(s));
}

fuzztest::Domain<std::string> ArbitraryE(const std::size_t s)
{
    return FlatMap(
        [](const std::size_t size) -> fuzztest::Domain<std::string> {
            if (size == 0)
            {
                return OneOf(ArbitraryT(size));
            }
            return OneOf(Map([](std::string&& e1, std::string&& e2) { return std::move(e1) + '|' + std::move(e2); },
                             ArbitraryE(size - 1),
                             ArbitraryE(size - 1)),
                         ArbitraryT(size));
        },
        Just(s));
}
} // namespace lazy

This lazy version is actually usable. Looking at the Runs/secs, tt is however slower than the grammar version. By a factor of 9 in the one comparison at -O2 I did. Of course it depends on the size passed to ArbitraryE. For large values it becomes unusable again. I suspect that fuzztest is somehow biased to recurse deeply.
Even setting size to 1, the fuzzer does not reach the same Runs/secs as with the grammar version.
It may however be the best alternative we have for complex recursive non-string domains 🤷‍♂️

Unlimited Lazy Recursion

This is not an actual suggestion, just a theoretical observation. Since we now have a lazy version, we don't actually need the limit anymore just to avoid infinite recursion. By using FlatMap without an inner domain, we can lazily construct a recursive domain without artificial limit, just as we could with DomainBuilder:

namespace lazy_no_limit
{
fuzztest::Domain<std::string> ArbitraryE();

fuzztest::Domain<char> ArbitraryH() { return ElementOf({'\\', '|', '*', '(', ')'}); }

fuzztest::Domain<std::string> ArbitraryG()
{
    return FlatMap([] {
        return OneOf(Map([](std::string&& e) { return '(' + std::move(e) + ')'; }, ArbitraryE()),
                     Map([](const char h) { return std::string{"\\"} + h; }, ArbitraryH()),
                     StringOf(AlphaNumericChar()).WithSize(1));
    });
}

fuzztest::Domain<std::string> ArbitraryF()
{
    return FlatMap([] {
        return OneOf(Map([](std::string&& f) { return std::move(f) + '*'; }, ArbitraryF()),
                     Map([](std::string&& f) { return std::move(f) + '?'; }, ArbitraryF()),
                     ArbitraryG());
    });
}

fuzztest::Domain<std::string> ArbitraryT()
{
    return FlatMap([] {
        return OneOf(Map([](std::string&& e1, std::string&& e2) { return std::move(e1) + std::move(e2); },
                         ArbitraryT(),
                         ArbitraryT()),
                     ArbitraryF());
    });
}

fuzztest::Domain<std::string> ArbitraryE()
{
    return FlatMap([] {
        return OneOf(Map([](std::string&& e1, std::string&& e2) { return std::move(e1) + '|' + std::move(e2); },
                         ArbitraryE(),
                         ArbitraryE()),
                     ArbitraryT());
    });
}
} // namespace lazy_no_limit

Unfortunately and not unexpected, we also get the same result as with DomainBuilder: A stack overflow 🙁

@racko
Copy link
Author

racko commented Feb 21, 2024

A warning about this has been added: #1007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant