Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support decompressing fixed huffman blocks #124

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

garymm
Copy link
Owner

@garymm garymm commented Jan 6, 2024

support decompressing fixed huffman blocks

Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe

huffman/src/bit_span.hpp Outdated Show resolved Hide resolved
huffman/src/bit_span.hpp Outdated Show resolved Hide resolved
huffman/src/decode.hpp Outdated Show resolved Hide resolved
huffman/src/decode.hpp Outdated Show resolved Hide resolved
huffman/src/decode.hpp Show resolved Hide resolved
src/decompress.hpp Outdated Show resolved Hide resolved
src/decompress.hpp Outdated Show resolved Hide resolved
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 0386453 to ddaa091 Compare February 11, 2024 23:53
@garymm garymm changed the base branch from Ie061dff47a33fdc8995770c7a5f98129b8788556 to master February 11, 2024 23:53
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch 3 times, most recently from ea4ce7c to 206623b Compare February 12, 2024 15:33
@garymm garymm changed the base branch from master to I745f447fb22a843257d1ae211a130cd39dad4ccc February 12, 2024 15:33
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 206623b to 85dd9ef Compare March 9, 2024 15:10
@garymm garymm changed the title WIP fixed huffman block type support decompressing fixed huffman blocks Mar 9, 2024
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 85dd9ef to ed089c3 Compare March 9, 2024 15:22
@garymm garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from 08cec2e to ae388d3 Compare March 9, 2024 15:22
Copy link

codecov bot commented Mar 9, 2024

Codecov Report

Attention: Patch coverage is 65.54622% with 41 lines in your changes are missing coverage. Please review.

Project coverage is 81.56%. Comparing base (1cd6846) to head (b29aaf3).

Files Patch % Lines
src/decompress.cpp 73.33% 18 Missing and 10 partials ⚠️
huffman/src/decode.hpp 7.14% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
- Coverage   82.40%   81.56%   -0.84%     
==========================================
  Files          16       17       +1     
  Lines         625      754     +129     
  Branches       39       59      +20     
==========================================
+ Hits          515      615     +100     
- Misses         92      113      +21     
- Partials       18       26       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huffman/src/bit_span.hpp Outdated Show resolved Hide resolved
huffman/src/bit_span.hpp Outdated Show resolved Hide resolved
huffman/src/bit_span.hpp Outdated Show resolved Hide resolved
huffman/src/decode.hpp Show resolved Hide resolved
code_table_pos = code_table.begin();
current_code = code{};
continue;
return {(*found)->symbol, bits_read};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return {(*found)->symbol, bits_read};
return {(*found)->symbol, (*found)->bitsize()};

although I think it's better to change the return type. decode_result is imbuing "error" semantics in values with encoded_size = 0.

I think it's easier to understand when error semantics are explicitly conveyed with a return type such as optional<encoding> or expected<encoding, error_type?>. Although the end-iterator pattern is also common enough that could also work.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah the suggested change is cleaner.
As for returning an optional or expected: I agree generally it's cleaner to signal errors explicitly, but:

  1. I want this code to be fast, so having an extra byte of the return type might actually matter.
  2. The type system allows encoded_size to be zero, but that's meaningless. So the type system is sort of forcing us to consider the possibility. May as well use that possibility to signal errors.

Copy link
Collaborator

@oliverlee oliverlee Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to use optional or expected for better semantics as a first pass.

As a second pass (or possibly in parallel with main development), we can replace std::optional with a compressed optional type that uses the "invalid" bit patterns to track if a value exists or does not.

There's a pretty good talk on this pattern here:
https://www.youtube.com/watch?v=MWBfmmg8-Yo

This appears to be a drop-in replacement:
https://github.com/Sedeniono/tiny-optional

I haven't looked at it much and I'm sure there's others as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But feel free to close if that's something you want to change later.

Comment on lines +78 to +134
std::span<std::byte> dst,
std::ptrdiff_t& dst_written,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::span<std::byte> dst,
std::ptrdiff_t& dst_written,
std::span<std::byte>& dst,

src/decompress.cpp Show resolved Hide resolved
Comment on lines +86 to +143
const auto lit_or_len_decoded = huffman::decode_one(len_table, src_bits);
if (not lit_or_len_decoded.encoded_size) {
return DecompressStatus::InvalidLitOrLen;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if src_bits doesn't have enough bits to decode a symbol - that would be "success" and this function should exit?

If decode_one is changed to return expected<>, maybe something like this?

const auto lit_or_len = huffman::decode_one(len_table, src_bits);
if (not lit_or_len) {
  return status_for(lit_or_len.error());
}

If src_bits doesn't have enough bits, you do have have a partially constructed code and iterator in the code table. You could potentially keep those in order to skip some computation when more bits arrive - although it's certainly simpler to just restart.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of a chunked / streaming input?
If not, I don't understand why not having enough bits would be success?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm thinking of the chunked/streaming case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a TODO to handle later.

src/decompress.cpp Outdated Show resolved Hide resolved
src/decompress.cpp Show resolved Hide resolved
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from ed089c3 to 9b8d407 Compare March 10, 2024 00:10
@garymm garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from ae388d3 to cb99492 Compare March 10, 2024 00:10
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 9b8d407 to 75b4fc2 Compare March 10, 2024 00:18
@garymm garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from cb99492 to 8783255 Compare March 10, 2024 00:18
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 75b4fc2 to 1b7d011 Compare March 10, 2024 00:27
@garymm garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch 2 times, most recently from cdd7b97 to 33fc1b4 Compare March 10, 2024 00:35
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 1b7d011 to 2488e0e Compare March 10, 2024 00:35
@garymm garymm marked this pull request as ready for review March 10, 2024 00:36
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from 2488e0e to d3c35c5 Compare March 10, 2024 00:45
@garymm garymm force-pushed the I745f447fb22a843257d1ae211a130cd39dad4ccc branch from 33fc1b4 to e18ab12 Compare March 10, 2024 00:45
Base automatically changed from I745f447fb22a843257d1ae211a130cd39dad4ccc to master March 10, 2024 00:50
Change-Id: I5a30394b46e113595336e89c954e03d3acb120fe
Change-Id: I9322af2674ab583f3cdb286ea06385587bacf670
Change-Id: Ib2a9cccc5df211363b2083b01d46043b329cd025
Change-Id: I5ec6e616b5ec5d33fb221e196fd9976702175bf8
@garymm garymm force-pushed the I5a30394b46e113595336e89c954e03d3acb120fe branch from d3c35c5 to 58f3d2e Compare March 10, 2024 20:03
Change-Id: Ib720fb2b200a72d54bae4246a42524ec9b1444b7
Copy link
Owner Author

@garymm garymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Let's talk on the phone or in person if there's still confusion or disagreement.

src/decompress.hpp Outdated Show resolved Hide resolved
/// X and Y, a string reference with <length = 5, distance = 2>
/// adds X,Y,X,Y,X to the output stream."
void copy_n(
std::span<const std::byte>::iterator src,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the signature and I documented the precondition.
I used these types becuse:

  • This function is called in only one place and these types are convenient. I would inline this function but I broke it out so I can test it separately.
  • Isn't iterator arithmetic preferred over pointer arithmetic?

Do you still think there's a reason to use byte*?

src/decompress.cpp Show resolved Hide resolved
return DecompressStatus::DstTooSmall;
}

std::copy_n(src_bits.byte_data(), len, dst.begin());
std::copy_n(src_bits.byte_data(), len, dst.begin() + dst_written);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we advance dst, my guess would be it's UB to do dst.begin() - distance. I need to go backwards. If you have a better suggestion LMK.

huffman/src/bit_span.hpp Outdated Show resolved Hide resolved
huffman/src/decode.hpp Show resolved Hide resolved
code_table_pos = code_table.begin();
current_code = code{};
continue;
return {(*found)->symbol, bits_read};
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah the suggested change is cleaner.
As for returning an optional or expected: I agree generally it's cleaner to signal errors explicitly, but:

  1. I want this code to be fast, so having an extra byte of the return type might actually matter.
  2. The type system allows encoded_size to be zero, but that's meaningless. So the type system is sort of forcing us to consider the possibility. May as well use that possibility to signal errors.

Comment on lines +86 to +143
const auto lit_or_len_decoded = huffman::decode_one(len_table, src_bits);
if (not lit_or_len_decoded.encoded_size) {
return DecompressStatus::InvalidLitOrLen;
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of a chunked / streaming input?
If not, I don't understand why not having enough bits would be success?

src/decompress.cpp Show resolved Hide resolved
@@ -58,22 +179,27 @@ auto decompress(std::span<const std::byte> src, std::span<std::byte> dst)
return DecompressStatus::SrcTooSmall;
}

if (dst.size() < len) {
if (dst.size() - static_cast<std::size_t>(dst_written) < len) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this comment is irrelevant unless we drpo the prefix of dst every time as suggested below.

Change-Id: I5c8a4f4c56f25ebe37a21caea97312309e19729c
Comment on lines +95 to +98
std::uint16_t len{};
if (lit_or_len == detail::lit_or_len_max) {
len = detail::lit_or_len_max_decoded;
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you can just return instead of construction + assignment of len?

if (lit_or_len == detail::lit_or_len_max) {
    return detail::lit_or_len_max_decoded;
}

break;
}
if (lit_or_len > detail::lit_or_len_max) {
const auto maybe_lit_or_len =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better to me.

I think this could still benefit from breaking out smaller functions - although if you think that's too much work, feel free to ignore.

  const auto done = 
    parse_lit_or_len(lit_or_len_decoded.symbol, src_bits)
      .and_then(decode_lit_or_len_into(dst))
      .or_else(convert_error_status)
      
   if (done) {
     return *done;
   }

   // done is empty, so the the loop continues

where convert_error_status: ParseLitOrLenStatus -> optional<DecompressStatus>

return {in_place, e == EndOfBlock ? Success : InvalidLitOrLen};

and decode_lit_or_len_into uses the poor version of pattern matching available in C++:

auto decode_lit_or_len_into(std::span<std::byte>& dst)
{
  return [&dst](variant<std::byte, std::uint16_t> lit_or_len) {
    return visit(
      overloaded({
        [&dst](byte lit) { return decode_into(lit, dst); },
        [&dst](uint16_t len) { return decode_into(lit, dst); }
      }),
      lit_or_len
    );
  };
}

// https://en.cppreference.com/w/cpp/utility/variant/visit
template <class... Ts>
struct overloaded : Ts... { using Ts::operator()...; };

and

auto decode_into(byte lit, span& dst)
  -> optional<DecompressStatus>
{
  // lines 133-137
}
auto decode_into(uint16_t len, span& dst)
  -> optional<DecompressStatus>
{
 // lines 139-161
}

although that's a bit of a simplification - you'll also need to capture dst_written.

The different cases are broken out into different functions, although there's some logic inversion with variant::visit.

Copy link
Collaborator

@oliverlee oliverlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions on refactoring the parsing function - I think a destination_buffer class could simplify that a bit as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants