GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41821

rok · 2024-05-24T21:44:56Z

Rationale for this change

Metadata written into _metadata file appears to not be encrypted.

What changes are included in this PR?

This adds a code path to encrypt _metadata file and a test.

Are these changes tested?

Yes

Are there any user-facing changes?

This adds user facing encryption_properties parameter to pyarrow.parquet.write_metadata.

GitHub Issue: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41719

github-actions · 2024-05-24T21:45:22Z

⚠️ GitHub issue #41719 has been automatically assigned in GitHub to PR creator.

rok · 2024-06-02T14:16:20Z

@AudriusButkevicius could you check if the read metadata contains expected properties?

AudriusButkevicius · 2024-06-02T15:57:43Z

Hey, thanks for working on this, looks great, however seems that something is missing.

You can read the metadata file, but seems no files are actually read when reading the dataset via the metadata file, so the table ends up empty (but with the right schema).

(arrow) root@base ~/code/arrow # python testx.py
/tmp/tmp8qr3xv_t
Writing
pyarrow.Table
col1: int64
col2: int64
year: int64
----
col1: [[1,2,3]]
col2: [[1,2,3]]
year: [[2020,2020,2021]]
write done


Reading
pyarrow.Table
col1: int64
col2: int64
year: int16
----
col1: []
col2: []
year: []

Code that reproduces the issue

Test code

import os
import tempfile

import pyarrow.parquet.encryption as pe
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow as pa
import base64
import polars as pl


class KmsClient(pe.KmsClient):
    def unwrap_key(self, wrapped_key, master_key_identifier):
        return base64.b64decode(wrapped_key)

    def wrap_key(self, key_bytes, master_key_identifier):
        return base64.b64encode(key_bytes)


def write(location):
    cf = pe.CryptoFactory(lambda *a, **k: KmsClient())
    df = pl.DataFrame({
        "col1": [1, 2, 3],
        "col2": [1, 2, 3],
        "year": [2020, 2020, 2021]
    })
    ecfg = pe.EncryptionConfiguration(
        footer_key="TEST",
        column_keys={
            "TEST": ["col2"]
        },
        double_wrapping=False,
        plaintext_footer=False,
    )
    table = df.to_arrow()
    print("Writing")
    print(table)

    parquet_encryption_cfg = ds.ParquetEncryptionConfig(
        cf, pe.KmsConnectionConfig(), ecfg
    )

    metadata_collector = []

    pq.write_to_dataset(
        table,
        location,
        partitioning=ds.partitioning(
            schema=pa.schema([
                pa.field("year", pa.int16())
            ]),
            flavor="hive"
        ),
        encryption_config=parquet_encryption_cfg,
        metadata_collector=metadata_collector
    )

    encryption_properties = cf.file_encryption_properties(pe.KmsConnectionConfig(), ecfg)

    pq.write_metadata(
        pa.schema(
            field
            for field in table.schema
            if field.name != "year"
        ),
        os.path.join(location, "_metadata"),
        metadata_collector,
        encryption_properties=encryption_properties,
    )
    print("write done")


def read(location):
    decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
    kms_connection_config = pe.KmsConnectionConfig()
    cf = pe.CryptoFactory(lambda *a, **k: KmsClient())
    parquet_decryption_cfg = ds.ParquetDecryptionConfig(
        cf, kms_connection_config, decryption_config
    )

    decryption_properties = cf.file_decryption_properties(
        kms_connection_config, decryption_config)
    pq_scan_opts = ds.ParquetFragmentScanOptions(
        decryption_config=parquet_decryption_cfg,
        # If using build from master
        decryption_properties=decryption_properties
    )
    pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)

    dataset = ds.parquet_dataset(
        os.path.join(location, "_metadata"),
        format=pformat,
        partitioning=ds.partitioning(
            schema=pa.schema([
                pa.field("year", pa.int16())
            ]),
            flavor="hive"
        )
    )
    print("Reading")
    print(dataset.to_table())

if __name__ == '__main__':
    location = tempfile.mkdtemp(suffix=None, prefix=None, dir=None)
    print(location)
    os.makedirs(location, exist_ok=True)
    write(location)
    print("\n")
    read(location)

AudriusButkevicius · 2024-06-02T16:04:22Z

Seems that dataset.get_fragments() doesn't return anything.

rok · 2024-06-02T16:11:57Z

Seems that dataset.get_fragments() doesn't return anything.

I think c++ logic as-is doesn't store row groups, let me take a look and get back.

AudriusButkevicius · 2024-06-02T16:15:38Z

I think the whole premise of _metadata files (not _common_metadata) is to store row group details as well as paths, so when you perform a read via the _metadata file, it knows exactly which files and row groups to read without having to open every file in the dataset. At least this is what happens when encryption is disabled.

wgtmac · 2024-06-03T02:16:04Z

Although I understand the intention of this issue and the corresponding fix, I don't think the design of parquet encryption has included the _metadata summary file because it may point to several different encrypted parquet files. It would be great if @ggershinsky could advise to see if there is any defection in this use case.

wgtmac · 2024-06-03T02:20:14Z

BTW, _metadata summary file is something that we strive to deprecate in the new parquet v3 design. If you have any concrete use case, please let us know to see if we can improve. @AudriusButkevicius

rok · 2024-06-03T03:23:35Z

I don't think the design of parquet encryption has included the _metadata summary file because it may point to several different encrypted parquet files.

Why would this be an issue? Because these files might be encrypted with different keys?

wgtmac · 2024-06-03T03:34:57Z

Yes, these files may have different master keys, which are unable to be referenced by a single footer IMHO.

rok · 2024-06-03T03:41:09Z

But the files can have the same key and decryption would error if not? That sounds ok to me.

rok · 2024-06-03T03:45:32Z

Python error is thrown here, but I'm not sure why.

AudriusButkevicius · 2024-06-03T06:55:06Z

My intended use of this is to reduce strain on the filesystem when reading large (many files) datasets from a network attached filesystem, by reading the metadata file instead of many separate files.

I also have a hard requirement for encryption sadly as the data is sensitive.

It would be amazing if this worked with encrypted datasets assuming the key is the same.

I would also be ok storing the metadata in plaintext, perform fragment filtering based on row-group stats, and then re-read and decrypt footers of the chosen files. Obviously this is ok for my usecase but generally might not be ok.

AudriusButkevicius · 2024-06-03T07:04:50Z

I didn't get any error when reading, seems that it just returns no data.

wgtmac · 2024-06-03T07:20:05Z

For the record: the community is inclined to deprecate ColumnChunk.file_path: https://lists.apache.org/thread/qprmj710yq92ybyt89m5xgtqyz3o3st2 and https://github.com/apache/parquet-format/pull/242/files#r1603234502

I'm not sure if we want to support and maintain this. cc @emkornfield @pitrou @mapleFU

AudriusButkevicius · 2024-06-03T07:55:58Z

Left my 2c there. Explained why I would be sad if that happened, and would probably have to re-implement the same feature.

rok · 2024-06-03T10:49:47Z

@wgtmac I'm not sure I follow. We already have WriteMetaDataFile to produce _metadata file and if we just add WriteEncryptedMetadataFile to produce encrypted _metadata files we're not really adding additional additional complexity (outside of WriteEncryptedMetadataFile) or am I missing something? :)

ggershinsky · 2024-06-03T12:43:05Z

Although I understand the intention of this issue and the corresponding fix, I don't think the design of parquet encryption has included the _metadata summary file because it may point to several different encrypted parquet files. It would be great if @ggershinsky could advise to see if there is any defection in this use case.

Yep, we haven't worked on supporting this (basically, there was no requirement; seemed heading towards deprecation).
In general, using different encryption keys for different data files is considered to be a good security practice (mainly because there is a limit on number of crypto operations with one key; also, the key leak scope is smaller) - that's why we generate a fresh key for each parquet file in most of the APIs (Parquet, Arrow, Spark, Iceberg, etc). However, there are obviously some low-level parquet APIs that will allow to pass the same key to many files - if used carefully (making sure, somehow, not to exceed the limit), this might be ok in some cases. The limit is hight (~1 billion pages, somethings like 10TB-1PB of data), but if exceeded, the cipher breaks and the data can be decrypted.
Another option could be to create a separate key for the _metadata summary file, and manage it separately from the data file keys.

pitrou · 2024-06-03T14:51:27Z

mainly because there is a limit on number of crypto operations with one key

What is the theoretical limit, assuming a 256-bit AES key?

Also, if column key encryption is used, wouldn't the limit basically become irrelevant?

rok · 2024-08-06T22:01:41Z

cpp/src/parquet/file_writer.cc

+  if (file_encryption_properties->encrypted_footer()) {
+    PARQUET_THROW_NOT_OK(sink->Write(kParquetEMagic, 4));
+
+    PARQUET_ASSIGN_OR_THROW(int64_t position, sink->Tell());
+    auto metadata_start = static_cast<uint64_t>(position);
+
+    auto writer_props = parquet::WriterProperties::Builder()
+                            .encryption(file_encryption_properties)
+                            ->build();
+    auto builder = FileMetaDataBuilder::Make(metadata.schema(), writer_props);
+
+    auto footer_metadata = builder->Finish(metadata.key_value_metadata());
+    auto crypto_metadata = builder->GetCryptoMetaData();
+    WriteFileCryptoMetaData(*crypto_metadata, sink.get());
+
+    auto footer_encryptor = file_encryptor->GetFooterEncryptor();
+    WriteEncryptedFileMetadata(metadata, sink.get(), footer_encryptor, true);
+    PARQUET_ASSIGN_OR_THROW(position, sink->Tell());
+    auto footer_and_crypto_len = static_cast<uint32_t>(position - metadata_start);
+    PARQUET_THROW_NOT_OK(
+        sink->Write(reinterpret_cast<uint8_t*>(&footer_and_crypto_len), 4));
+    PARQUET_THROW_NOT_OK(sink->Write(kParquetEMagic, 4));
+  } else {


I'm doing metadata encryption similarly to approach here:

arrow/cpp/src/parquet/file_writer.cc

Lines 416 to 432 in a4d58e0

if (file_encryption_properties->encrypted_footer()) {

// encrypted footer

file_metadata_ = metadata_->Finish(key_value_metadata_);

PARQUET_ASSIGN_OR_THROW(int64_t position, sink_->Tell());

uint64_t metadata_start = static_cast<uint64_t>(position);

auto crypto_metadata = metadata_->GetCryptoMetaData();

WriteFileCryptoMetaData(*crypto_metadata, sink_.get());

auto footer_encryptor = file_encryptor_->GetFooterEncryptor();

WriteEncryptedFileMetadata(*file_metadata_, sink_.get(), footer_encryptor, true);

PARQUET_ASSIGN_OR_THROW(position, sink_->Tell());

uint32_t footer_and_crypto_len = static_cast<uint32_t>(position - metadata_start);

PARQUET_THROW_NOT_OK(

sink_->Write(reinterpret_cast<uint8_t*>(&footer_and_crypto_len), 4));

PARQUET_THROW_NOT_OK(sink_->Write(kParquetEMagic, 4));

} else { // Encrypted file with plaintext footer

However it seems decryption fails (see below) when using RowGroup metadata (after deserialization and decryption).

arrow/cpp/src/arrow/dataset/file_parquet.cc

Lines 1084 to 1086 in 4d5041a

ARROW_ASSIGN_OR_RAISE(auto path,

FileFromRowGroup(filesystem.get(), base_path, *row_group,

options.validate_column_chunk_paths));

This makes me think I'm either not serializing correctly or there's an issue with encryption/decryption properties I'm supplying.

@wgtmac @pitrou does anything obviously wrong stands out here?

cc @mapleFU

What error did you get? I suspect it has some issues with row group ordinal. Please search "row group ordinal" from https://github.com/apache/parquet-format/blob/master/Encryption.md

The actual error is thrown here:

arrow/cpp/src/arrow/dataset/file_parquet.cc

Line 1009 in 4d5041a

auto path = row_group.ColumnChunk(0)->file_path();

in debugger I see row_group.ColumnChunk(0) as null and row_group.ColumnChunk(0)->file_path() fails.

The error message I'm getting is:

unknown file: Failure C++ exception with description "Failed decryption finalization" thrown in the test body.

For the test FileMetadata object this reads several files, extracts their metadata and merges them (metadata->AppendRowGroups(*metadata_vector[1])).

We could solve this by storing file aad prefixes into key_value_metadata so they can be used at read time to decode row group metadata. This seems to work (see cpp/src/parquet/metadata.cc), but reading actual data with scanner->ToTable() is somewhat less straightforward. but seems doable. Before proceeding, I'd like to ask here:

does this approach make sense?

what would a good location be for injecting aad prefixes be for reading column data?

cc @wgtmac @ggershinsky

The format has a field for aad prefixes, https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1126 .

It is file-wide, of course. So if parquet files had different aad prefixes, a new one has to be used for the merged metadata.

Per my comment above, I think the only way to make this work is to fully decrypt the footers of the parquet files (using their encryption params, inc keys and aad_prefixes), merge them, and then encrypt the result with a new key/aad_prefix

rok · 2024-10-08T23:29:07Z

cpp/src/parquet/metadata.cc

+  for (size_t i = 0; i < metadata_list.size(); i++) {
+    const auto& file_metadata = metadata_list[i];
+    keys.push_back("row_group_aad_" + std::to_string(i));
+    values.push_back(file_metadata->file_aad());
+    if (i > 0) {
+      metadata_list[0]->AppendRowGroups(*file_metadata);
+    }
+  }
+
+  // Create a new FileMetadata object with the created AADs as key_value_metadata.
+  auto fmd_builder =
+      parquet::FileMetaDataBuilder::Make(metadata_list[0]->schema(), writer_props);
+  const std::shared_ptr<const KeyValueMetadata> file_aad_metadata =
+      ::arrow::key_value_metadata(keys, values);
+  auto metadata = fmd_builder->Finish(file_aad_metadata);
+  metadata->AppendRowGroups(*metadata_list[0]);


@ggershinsky As proposed I now decrypt all footers and then coalesce them into a single footer. As decrypting data files at read times requires AADs I also store those into key_value_metadata with row_group_aad_{i} keys. Does this seem reasonable design?

@rok sorry for the delay, I've been away for a while.

store those into key_value_metadata with row_group_aad_{i} keys.

Do we need to store the aad_prefixes ? Once a footer of a parquet file is decrypted, the file key and aad_prefix can be dropped. Aad_prefix is a user-provided unique ID of a file. So we can(*) generate a new one for the new _metadata file that keeps the coalesced footer. Also, it'd be good to generate a new key. Then the coalesced footer can be encrypted in the _metadata file.

(*) it's possible to write/encrypt the _metadata file without a new aad_prefix, if the user app level doesn't check the file id. You can simply pass a null pointer.

rok · 2024-10-09T21:43:51Z

cpp/src/parquet/metadata.cc

+    if (key_value_metadata_ && key_value_metadata_->Contains("row_group_aad_0")) {
+      PARQUET_ASSIGN_OR_THROW(
+          auto aad,
+          key_value_metadata_->Get("row_group_aad_" + std::to_string(row_groups[0])));
+      out->set_file_decryptor_aad(aad);
+    }


@pitrou regarding your suggestion to avoid storing AADs as key_value_metadata entries and instead read them from files at dataset scan time - do you have a suggestion on how/where to detect we'll no longer be reading _metadata file but rather data files? If we can do that we can change the decryptor's AAD there and read the desired data file.

github-actions bot added Component: Parquet Component: C++ Component: Python awaiting committer review Awaiting committer review labels May 24, 2024

rok force-pushed the gh-41719_fix__metadata_file branch from 7be8f5e to 26cfd63 Compare June 2, 2024 14:12

rok marked this pull request as ready for review June 2, 2024 14:13

rok requested a review from wgtmac as a code owner June 2, 2024 14:13

rok requested review from pitrou and removed request for wgtmac June 2, 2024 14:14

rok added this to the 17.0.0 milestone Jun 2, 2024

rok force-pushed the gh-41719_fix__metadata_file branch from 26cfd63 to e0c9418 Compare June 3, 2024 03:19

rok force-pushed the gh-41719_fix__metadata_file branch from cd5bf41 to a4d58e0 Compare August 6, 2024 21:31

rok commented Aug 6, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 6, 2024

rok force-pushed the gh-41719_fix__metadata_file branch from a4d58e0 to fa448cf Compare August 16, 2024 13:58

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 16, 2024

rok force-pushed the gh-41719_fix__metadata_file branch from fa448cf to 4348123 Compare August 16, 2024 19:04

rok force-pushed the gh-41719_fix__metadata_file branch from 4348123 to f70a009 Compare August 23, 2024 20:00

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 11, 2024

rok force-pushed the gh-41719_fix__metadata_file branch from d784b1e to d764b57 Compare September 26, 2024 22:20

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 26, 2024

rok force-pushed the gh-41719_fix__metadata_file branch 5 times, most recently from 7c4b5bb to ec833ef Compare September 30, 2024 13:59

rok force-pushed the gh-41719_fix__metadata_file branch from ec833ef to 35b22d6 Compare October 8, 2024 19:27

github-actions bot removed the Component: Python label Oct 8, 2024

rok force-pushed the gh-41719_fix__metadata_file branch 3 times, most recently from ec79220 to 366ffc9 Compare October 8, 2024 21:10

Initial commit

c015985

rok force-pushed the gh-41719_fix__metadata_file branch from 366ffc9 to c015985 Compare October 8, 2024 21:55

rok commented Oct 8, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 8, 2024

rok commented Oct 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41821

GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41821

rok commented May 24, 2024 •

edited by github-actions bot

Loading

github-actions bot commented May 24, 2024

rok commented Jun 2, 2024

AudriusButkevicius commented Jun 2, 2024 •

edited

Loading

AudriusButkevicius commented Jun 2, 2024

rok commented Jun 2, 2024

AudriusButkevicius commented Jun 2, 2024 •

edited

Loading

wgtmac commented Jun 3, 2024

wgtmac commented Jun 3, 2024

rok commented Jun 3, 2024

wgtmac commented Jun 3, 2024

rok commented Jun 3, 2024

rok commented Jun 3, 2024

AudriusButkevicius commented Jun 3, 2024 •

edited

Loading

AudriusButkevicius commented Jun 3, 2024

wgtmac commented Jun 3, 2024

AudriusButkevicius commented Jun 3, 2024

rok commented Jun 3, 2024 •

edited

Loading

ggershinsky commented Jun 3, 2024

pitrou commented Jun 3, 2024 •

edited

Loading

rok Aug 6, 2024

rok Aug 7, 2024

wgtmac Aug 7, 2024

rok Aug 7, 2024 •

edited

Loading

rok Aug 7, 2024

rok Sep 11, 2024

ggershinsky Sep 11, 2024

rok Oct 8, 2024

ggershinsky Oct 28, 2024

rok Oct 9, 2024

	if (file_encryption_properties->encrypted_footer()) {
	// encrypted footer
	file_metadata_ = metadata_->Finish(key_value_metadata_);

	PARQUET_ASSIGN_OR_THROW(int64_t position, sink_->Tell());
	uint64_t metadata_start = static_cast<uint64_t>(position);
	auto crypto_metadata = metadata_->GetCryptoMetaData();
	WriteFileCryptoMetaData(*crypto_metadata, sink_.get());

	auto footer_encryptor = file_encryptor_->GetFooterEncryptor();
	WriteEncryptedFileMetadata(*file_metadata_, sink_.get(), footer_encryptor, true);
	PARQUET_ASSIGN_OR_THROW(position, sink_->Tell());
	uint32_t footer_and_crypto_len = static_cast<uint32_t>(position - metadata_start);
	PARQUET_THROW_NOT_OK(
	sink_->Write(reinterpret_cast<uint8_t*>(&footer_and_crypto_len), 4));
	PARQUET_THROW_NOT_OK(sink_->Write(kParquetEMagic, 4));
	} else { // Encrypted file with plaintext footer

	ARROW_ASSIGN_OR_RAISE(auto path,
	FileFromRowGroup(filesystem.get(), base_path, *row_group,
	options.validate_column_chunk_paths));

GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41821

Are you sure you want to change the base?

GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41821

Conversation

rok commented May 24, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 24, 2024

rok commented Jun 2, 2024

AudriusButkevicius commented Jun 2, 2024 • edited Loading

AudriusButkevicius commented Jun 2, 2024

rok commented Jun 2, 2024

AudriusButkevicius commented Jun 2, 2024 • edited Loading

wgtmac commented Jun 3, 2024

wgtmac commented Jun 3, 2024

rok commented Jun 3, 2024

wgtmac commented Jun 3, 2024

rok commented Jun 3, 2024

rok commented Jun 3, 2024

AudriusButkevicius commented Jun 3, 2024 • edited Loading

AudriusButkevicius commented Jun 3, 2024

wgtmac commented Jun 3, 2024

AudriusButkevicius commented Jun 3, 2024

rok commented Jun 3, 2024 • edited Loading

ggershinsky commented Jun 3, 2024

pitrou commented Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented May 24, 2024 •

edited by github-actions bot

Loading

AudriusButkevicius commented Jun 2, 2024 •

edited

Loading

AudriusButkevicius commented Jun 2, 2024 •

edited

Loading

AudriusButkevicius commented Jun 3, 2024 •

edited

Loading

rok commented Jun 3, 2024 •

edited

Loading

pitrou commented Jun 3, 2024 •

edited

Loading

rok Aug 7, 2024 •

edited

Loading