Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddressSanitizer: allocator is out of memory #3952

Closed
tianmai1 opened this issue Aug 21, 2023 · 18 comments · Fixed by #3954
Closed

AddressSanitizer: allocator is out of memory #3952

tianmai1 opened this issue Aug 21, 2023 · 18 comments · Fixed by #3954

Comments

@tianmai1
Copy link

version: 2.4 12.0
./bin/iconvert crash_000122 1.png

=================================================================
==121851==ERROR: AddressSanitizer: allocator is out of memory trying to allocate 0x3b7f84804 bytes
    #0 0x4c9e6d in operator new[](unsigned long) (/root/oiio/build/bin/iconvert+0x4c9e6d)
    #1 0x7f22ca2b4b9d in OpenImageIO_v2_4::ImageOutput::copy_image(OpenImageIO_v2_4::ImageInput*) /root/oiio/src/libOpenImageIO/imageoutput.cpp:587:36
    #2 0x4d169b in convert_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /root/oiio/src/iconvert/iconvert.cpp:449:27
    #3 0x4cd218 in main /root/oiio/src/iconvert/iconvert.cpp:523:14
    #4 0x7f22c87f2082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

==121851==HINT: if you don't care about these errors you may set allocator_may_return_null=1
SUMMARY: AddressSanitizer: out-of-memory (/root/oiio/build/bin/iconvert+0x4c9e6d) in operator new[](unsigned long)
==121851==ABORTING

poc:
crash_000122.zip

@ThiagoIze
Copy link
Collaborator

It's out of memory when trying to allocate 16GB. Could it just be that your machine didn't have enough free RAM?

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

But I think the point is that it only looks like a 16GB image because it's a corrupted file with an absurd resolution.

There is an int limits:imagesize_MB attribute added recently -- a sanity check on image size, where images that purport to be larger than this are presumed to be corrupted or malicious. For apps that actually expect images larger than this, you can, of course set it as large as you want. The default is 32GB.

Question 1:

Should it be a lower default? It was hard to know the tradeoff between catching allocation errors from corrupt files, versus "false negative" cases of rejecting legitimately huge images. But maybe the 32GB default (for a single image, remember) is still drastically too big considering how commonly people will run on VMs with lower memory limits than this. For comparison, a 32k x 32k RGBA uint16 image is 8GB uncompressed.)

Question 2:

This limit attribute is informational only, it's up to each format reader to check the allocation size against this limit. PNG currently does not! So that should be fixed -- but note that even if we fixed it today, it would happily let the allocation happen if the default is 32GB.

Crazy idea: What if instead of defaulting this limit to a flat 32GB, we set it to min(32GB, physical_memory/2)? The advantage is, when running (or doing CI, or fuzz-testing) on a low-memory machine, it still will catch what are likely to be errors. The disadvantage is, the default of the attribute might depend on which machine you are running on. But then again, it's still enormous for a single image, and sophisticated apps that might really deal with giga-images can always proactively set the attribute to a size they are comfortable with.

@ThiagoIze
Copy link
Collaborator

Sanity check so that there's a clean early out is nice, especially when the machine does have TBs of RAM but we don't want to wait hours for it to try to process a corrupt image.

But I don't think limits:imagesize_MB will solve the fundamental issue raised here, which is that we need to check the result of all of our thousands of new/malloc calls to see if they succeeded and then properly handle these failures. That's a lot of work and clutters the code, so I don't know if it's worth it. The reason I think this won't help with the fundamental issue is that whatever value you pick, it's still possible to fail. Maybe the user has 32GB of RAM, so physical_memory/2 returns 16GB and this 16GB allocation request is allowed to happen. But let's say free memory was only 32MB at this stage so of course it failed.

Maybe we need to be strategic and find the few places where we are likely to allocate GBs of data (anywhere that allocates all the pixels?) and check if the allocation passed?

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

May be relevant: https://stackoverflow.com/questions/18684951/how-and-why-an-allocation-memory-can-fail

I think the intent of limits:imagesize_MB is actually not to catch or recover from bad allocations (which is both difficult to catch everywhere in the code base, and as noted in the link above, may not work the way you think on modern systems).

Rather, it's intending to catch corrupted or malicious files that have implausible resolutions. Like you said, even if a fishy allocation "succeeds", it may be so large that it's effectively a DOS attack. Ideally, every reader should do this check. In practice, I implemented it for TIFF (adding limits:imagesize_MB in the process) but only fixed that one error I was facing at the time (very similar to this one -- the result of fuzz testing TIFF files) and never quite got around to revisiting all the other readers.

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

Side note: the POC file here is 21 bytes, so this is not in fact a large image, it is a corrupted/malicious header masquerading an image file, and the resolution it describes is implausibly large.

@jessey-git
Copy link
Contributor

Yeah, I've always interpreted that limit as just a safety net for "zip bombs" (a small file, 21 bytes here, expanding out to several gb of memory): https://en.wikipedia.org/wiki/Zip_bomb

Writing software that properly handles new returning null or bad_alloc is... rarely ever done, attempted, or gotten correct. In this particular case I'd be more concerned if the operation that triggered the memory allocation eventually fails gracefully when it realizes there's not really 16gb of data to read :)

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

On my laptop, with plenty of memory for a 16GB allocation, it still fails very quickly with:

$ iconvert crash_000122 1.png
iconvert ERROR copying "crash_000122" to "1.png" :
	Read error: hit end of file

So at least it does detect in other ways that the file is bad before getting into too much trouble... unless the allocation fails.

It's hard to tell from this error message exactly which reader caused the problem (because there is no file extension and it's not valid anyway, it will try them all). Turns out the bit pattern of the file looks like a Targa file, and that's the reader that hits the EOF while reading the pixels.

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

That error message I get is not a crash, by the way. It's a graceful exit with error message.
The crash only happens if we try the unacceptable allocation.

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

Notice how the stack trace from address sanitizer says it failed INSIDE operator new. Thus, the application (a) never got back a null, (b) nor did it fail to catch a bad_alloc exception. It just dies inside new, so there is nothing we can do in our code to handle the problem if the allocation fails, because control never returns to us. We have to detect the allocation size as implausible beforehand and just say no.

Back to my suggestions:

I feel like lowering the default is an independent question -- it's valid to ask if 32GB is too high a default, but at the same time, if we lowered to 16GB, there certainly exists a VM somewhere where even that can fail. (We have to ask ourselves, though, if we care.)

I still kinda like my suggestion about lowering the default to be at most some fraction of physical memory size to try to catch it regardless of how constrained a system it is.

@lgritz
Copy link
Collaborator

lgritz commented Aug 21, 2023

I have a patch to suggest, coming shortly.

@lgritz
Copy link
Collaborator

lgritz commented Aug 22, 2023

Proposed fix: #3954

@ThiagoIze
Copy link
Collaborator

That error message is a graceful exit when running under address sanitizer because address sanitizer has instrumented the code to look for nullptr returned from new and added code to gracefully exit. The real OIIO code would likely have crashed once it tried to access this null pointer unless it already is handling this case.

I still think there's nothing we can really do for this particular issue short of adding null checks after the allocation, which sometimes will help and other times won't for the reasons lg linked to already. Machines with enough memory, like lg's will be able to do the allocation. Machines without enough memory will crash. Tweaking limits:imagesize_MB can maybe help in some cases but it would be trivial to create a new file that still crashes in certain situations. What's more, I'm worried about users with say 4GB of RAM that will now no longer be able to operate on legitimately sized textures even though they have the SWAP space to do it.

@lgritz
Copy link
Collaborator

lgritz commented Aug 22, 2023

As it turns out, it wouldn't have crashed outside the sanitizer (I think???), because I found the exact site of the new call that fails, and it is in fact surrounded by a try/catch block to detect a failed allocation.

You are right, the physical_memory/2 trick is a little hokey and can backfire on people with 4GB of RAM as you say. But how often is that really the case, the combination of huge image, tiny memory, and user/app that doesn't proactively raise the attribute to reflect the size of image they expect?

If you or others don't think that changing the default based on the physical memory is a good idea, I can back out of that aspect of the PR.

But I'm not sure what is the best thing to do in this case.

I hear what you're saying about the crash in new being an artifact of address sanitizer. The dilemma, though, is that there are people out there running fuzz tests against the software with the sanitizer, and will report these crashes here (or file CVE's) when these failures occur, unaware that they are seeing a crash that can only happen for the address sanitizer. Yet each one will still require us to track it down and verify that without the sanitizer in place, we correctly catch the failed allocation.

So I still want a way, even in the sanitizer case, to find a likely-to-be-mistaken image size ahead of the allocation in order to stem the tide of these pseudo-bugs being reported.

@ThiagoIze
Copy link
Collaborator

I can imagine a user app is oiiotool or maketx. Do these tools proactively raise the attribute? If it requires a special flag, that's extra complexity and chances are a user that doesn't have huge amounts of RAM probably is not a power user and so will have an extra hard time knowing how to pass the appropriate flags to overcome this. Doubly so if these tools are being wrapped by a script/GUI.

As for user/machine, it's not that rare for non-professional users, such as students, to have a modest machine. 8GB of RAM until recently was the default for most mac laptops. So half of that is 4GB. I'll admit 4GB is still a pretty big and rare image, so perhaps the majority of users would still be fine with this limitation. So I won't fight against the divide by 2 proposal if you think it's needed.

For the fuzz tests, it would not surprise me if eventually we get people running them on VMs that have a few GBs of RAM, no swap, and have much less than half of their RAM still free. They'll report issues which none of us will be able to repro on our production hardware and we'll spend extra time trying to figure out what is going on. I do like your suggestion in the PR of checking if address sanitizer is running. I suggest though getting very aggressive. Don't divide by 2. Divide by 8. You can do this with __has_feature(address_sanitizer) as explained in https://clang.llvm.org/docs/AddressSanitizer.html#conditional-compilation-with-has-feature-address-sanitizer.

@lgritz
Copy link
Collaborator

lgritz commented Aug 22, 2023

I was thinking of "don't divide by 2" also, but by just using the physical memory itself. The rationale is that it gives more tolerance for users with small mem computers, and it still signifies an important boundary -- if they try to read images larger than this into memory, at the very least they will see swapping and a huge reduction in performance.

The error message itself SAYS why it's rejecting the file and the name of the attribute to set to make the error go away if they really do intend to have files that large.

I'd like to think (but could obviously be wrong) that the combo of { small machine, not a power user } might be fairly unlikely to have multi-GB images that aren't mistakes. Remember, for reference, that a 16k x 16k x RGBA x half image is still only 2 GB.

The GitHub runners have 8GB VMs, BTW.

So I think the choices we have before us are:

  1. Fixed limit, but lower than the current default of 32 GB. I fear that on small VMs or old laptops, they may still fail these fuzz tests unless we lower so far that we might have users with legit big images that trigger the warning.

  2. Fixed limit, but go much lower (maybe 4GB?) when address sanitizer is detected. This will gracefully error/fail for a lot of these corrupted files when being fuzz tested, but we are taking it on faith that when not using the sanitizer, will will catch the corrupt files in another way.

  3. Limit of min(fixed_default, fraction * physical_memory), so low memory situations (not just sanitizer) catches both sure-to-fail allocations and also will-succeed-but-surely-will-swap bad performance cases. (The value of fraction is up for debate, but I'd be ok with either 0.5 or 1.0.) Remember that the error message itself says what attribute to alter if this is a false positive.

  4. Like (3), but only cut down the value with a physical memory limit in the sanitizer case.

  5. Unknown other solution that hasn't occurred to us.

Allocations that "succeed" but lead to terrible performance aren't really desired anyway. I still think there's merit to the default size for the sanity check size to be somehow tied to the physical memory, but maybe 0.5 was too aggressive?

@ThiagoIze
Copy link
Collaborator

Option 2 might prevent useful fuzzing tests where we want to detect bugs when exceeding 32-bit counters. If the machine has enough memory, they should be allowed to test that. So option 3 or 4 are the ones I vote for since it avoids the bulk of the out of memory cases while still letting those with lots of RAM to use OIIO without noticeable limitations.

For 4, I would use the 0.5 multiplier or smaller. For 3 (real users) I would maybe suggest 1.0 so that we're a little bit more lenient.

@lgritz
Copy link
Collaborator

lgritz commented Aug 22, 2023

I'm currently leaning toward 3, but raising the fraction to 1, so that the only cases where we might have a false positive for small-memory users are ones where they will end up swapping anyway. We can always revisit if this is causing problems for people.

@lgritz
Copy link
Collaborator

lgritz commented Aug 22, 2023

I'm making that change and will update the PR shortly.

Just to reiterate a very subtle point: Just because a full uncompressed image is larger than available memory doesn't necessarily mean that it couldn't be processed in certain ways by OIIO (consider the case of carefully reading, processing, and writing one scanline at a time, so the working set can be tiny). This limit is not "how big an image is too big to process," but rather it means "how big is implausible to be a 'real' image rather than a corrupted one." There is no correct answer to that question, it's a heuristic that we are using to catch many corrupt images while having very few false positives. And we're assuming that users are unlikely to be encountering legit images that exceed the physical memory of the machine they have chosen for the task, unless they are aware that they are doing something unusual and raising the limit.

lgritz added a commit that referenced this issue Aug 23, 2023
Fixes #3952

Several interrelated fixes to guard against corrupted files:

* Extend the `limits:imagesize_MB` sanity check to Targa files to
address #3952 specifically, and check in that failing example file into
the testsuite. (N.B.: Only a couple readers use this, TIFF and now
Targa. it should be extended to the others. Mackerel!)

* Improve the error message used for the above, both for Targa and TIFF.

* For low-memory situations, perhaps the default of 32GB is too large.
Make the default be the minimum of 32GB and 1/2 the size of physical
memory on the machine. (An app that legit expects humongous files is
expected to raise the limit appropriately.)

* The generic "Read error: hit end of file" we sometimes issued from
`ImageInput::ioread()` was hard to discern which reader it came from, in
cases where the file extension was absent or misleading and a succession
of files were tried. By changing this error message to say which format
reader triggered the error, it might make these errors easier to track
down in the future. This also changed the reference output of a few test
involving truncated or corrupted files.

---------

Signed-off-by: Larry Gritz <[email protected]>
lgritz added a commit to lgritz/OpenImageIO that referenced this issue Aug 25, 2023
…dation#3954)

Fixes AcademySoftwareFoundation#3952

Several interrelated fixes to guard against corrupted files:

* Extend the `limits:imagesize_MB` sanity check to Targa files to
address AcademySoftwareFoundation#3952 specifically, and check in that failing example file into
the testsuite. (N.B.: Only a couple readers use this, TIFF and now
Targa. it should be extended to the others. Mackerel!)

* Improve the error message used for the above, both for Targa and TIFF.

* For low-memory situations, perhaps the default of 32GB is too large.
Make the default be the minimum of 32GB and 1/2 the size of physical
memory on the machine. (An app that legit expects humongous files is
expected to raise the limit appropriately.)

* The generic "Read error: hit end of file" we sometimes issued from
`ImageInput::ioread()` was hard to discern which reader it came from, in
cases where the file extension was absent or misleading and a succession
of files were tried. By changing this error message to say which format
reader triggered the error, it might make these errors easier to track
down in the future. This also changed the reference output of a few test
involving truncated or corrupted files.

---------

Signed-off-by: Larry Gritz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants