Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Empty or null list(s) results in scrambled data #120

Open
chris-branch opened this issue Sep 30, 2024 · 1 comment
Open

[BUG] Empty or null list(s) results in scrambled data #120

chris-branch opened this issue Sep 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@chris-branch
Copy link

Parquet Viewer Version
2.10.1.1

Where was the parquet file created?
Parquet.NET

Description
There is something wrong with the code that parses lists/arrays. If you have a column that is a list/array type, and you have rows where that column is either empty (i.e., 0 elements) or null, ParquetViewer shows the data mixed up across rows. Examples:

In all examples below, assume the following schema:

    internal class TestRow
    {
        public string Column1 { get; set; }
        public List<double> Column2 { get; set; }

        public TestRow(string column1, List<double> column2)
        {
            Column1 = column1;
            Column2 = column2;
        }
    }

Example 1: This has no nulls or empty values and works as expected:

    List<TestRow> data1 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
        new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data1, @"sample1.parquet").Wait();

sample1

Example 2: This has an empty list in row 1 and results in scrambled data in rows 1-3

    List<TestRow> data2 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double>()),
        new TestRow("Row 2", new List<double> { 6, 7, 8, 9, 10 }),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data2, @"sample2.parquet").Wait();

sample2

Example 3: This has an empty list in row 2 and results in scrambled data in rows 2-3

    List<TestRow> data3 = new List<TestRow>
    {
        new TestRow("Row 1", new List<double> { 1, 2, 3, 4, 5 }),
        new TestRow("Row 2", new List<double>()),
        new TestRow("Row 3", new List<double> { 11, 12, 13, 14, 15 })
    };
    ParquetSerializer.SerializeAsync(data3, @"sample3.parquet").Wait();

sample3

Sample files
sample_parquets.zip

@chris-branch chris-branch added the bug Something isn't working label Sep 30, 2024
@AndreiYachmeneu
Copy link

Here is another example:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([["dog", "cat"], [], None], type=pa.list_(pa.string()))
tbl = pa.table([arr], names=['animals'])

pq.write_table(tbl, "animals.parquet")
print(pq.read_table("animals.parquet").to_pandas())

image

None displays as [] in ParquetViewer:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants