Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avsc_to_pydantic issue with nullable nested schemas #112

Open
myjniak opened this issue Jun 27, 2024 · 3 comments
Open

avsc_to_pydantic issue with nullable nested schemas #112

myjniak opened this issue Jun 27, 2024 · 3 comments

Comments

@myjniak
Copy link

myjniak commented Jun 27, 2024

When encountering nested schemas, (type "record"), avsc_to_pydantic gives pydantic class attributes and the class name for that attribute exactly the same name.
According to this comment, pydantic might not support that:
pydantic/pydantic#7871 (comment)
you can't name a field to have the same name as a model

Here's a code to reproduce this:

from pydantic_avro.avro_to_pydantic import avsc_to_pydantic

schema = {
    "type": "record",
    "name": "my_schema",
    "namespace": "my_namespace",
    "fields": [
        {
            "name": "nested",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "nested",
                    "namespace": "my_namespace.nested",
                    "fields": [
                        {
                            "name": "my_field",
                            "type": ["null", "string"],
                            "default": None
                        }
                    ]
                }
            ],
            "default": None
        }
    ]
}

pydantic_class: str = avsc_to_pydantic(schema)
print(pydantic_class)

Then we get this in return:

from datetime import date, datetime, time
from decimal import Decimal
from enum import Enum
from typing import List, Optional, Dict, Union
from uuid import UUID

from pydantic import BaseModel, Field


class nested(BaseModel):
    my_field: Optional[str] = None


class my_schema(BaseModel):
    nested: Optional[nested] = None

which will result in an exception, when instantiated:

my_schema(nested={"my_field": "value"})
Traceback (most recent call last):
  File "C:\Users\myjni\PycharmProjects\brudnopis\pydanticavrobug.py", line 52, in <module>
    pydantic_instance = my_schema(nested={"my_field": "value"})
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\myjni\PycharmProjects\brudnopis\.venv\Lib\site-packages\pydantic\main.py", line 171, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for my_schema
nested
  Input should be None [type=none_required, input_value={'my_field': 'value'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/none_required

The submodel class names should probably be transparent for the avsc_to_pydantic user, so my proposition to solve this problem would be to add a "Model" postfix to pydantic submodel class names:

class nestedModel(BaseModel):
    my_field: Optional[str] = None


class my_schema(BaseModel):
    nested: Optional[nestedModel] = None

If the class names are important for some reason though, then maybe field aliasing could help? Field(alias="something")
Let me know what do you think :)

@myjniak
Copy link
Author

myjniak commented Jul 11, 2024

I am open to make a Pull request for this as long as I have your opinion and guidelines :)

@myjniak
Copy link
Author

myjniak commented Jul 11, 2024

After some more investigation I think the problem can be more generalized to repetition of field names and class names.
Here's an example of class name overlap:

{
    "type": "record",
    "name": "possesions",
    "namespace": "my_namespace",
    "fields": [
        {
            "name": "car",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "car",
                    "fields": [
                        {
                            "name": "id",
                            "type": [
                                "null",
                                {
                                    "type": "record",
                                    "name": "id",
                                    "fields": [
                                        {
                                            "name": "vin",
                                            "type": ["null", "string"],
                                            "default": null
                                        },
                                        {
                                            "name": "vrn",
                                            "type": ["null", "string"],
                                            "default": null
                                        }
                                    ]
                                }
                            ],
                            "default": null
                        }
                    ]
                }
            ],
            "default": null
        },
        {
            "name": "house",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "house",
                    "fields": [
                        {
                            "name": "id",
                            "type": [
                                "null",
                                {
                                    "type": "record",
                                    "name": "id",
                                    "fields": [
                                        {
                                            "name": "land_and_mortgage_register_number",
                                            "type": ["null", "string"],
                                            "default": null
                                        }
                                    ]
                                }
                            ],
                            "default": null
                        }
                    ]
                }
            ],
            "default": null
        }
    ]
}

My idea to fix this would be to change class naming scheme to contain parent names as well.
This should assure name uniqueness. In the above example we could generate this:

class possesions_car_id(BaseModel):
    vin: Optional[str] = None
    vrn: Optional[str] = None


class possesions_car(BaseModel):
    id: Optional[possesions_car_id] = None


class possesions_house_id(BaseModel):
    land_and_mortgage_register_number: Optional[str] = None


class possesions_house(BaseModel):
    id: Optional[possesions_house_id] = None


class possesions(BaseModel):
    car: Optional[possesions_car] = None
    house: Optional[possesions_house] = None

while currently the 2nd section named "id" overwrites the 1st one:

class id(BaseModel):
    land_and_mortgage_register_number: Optional[str] = None


class car(BaseModel):
    id: Optional[id] = None


class house(BaseModel):
    id: Optional[id] = None


class possesions(BaseModel):
    car: Optional[car] = None
    house: Optional[house] = None

@dan1elt0m
Copy link
Collaborator

Hey @myjniak, Thanks for opening this issue and the clear description! Models currently have to have a unique name, because they are referred to by their name. This is necessary as the models are exported to a single file. In addition, as you mentioned, Pydantic doesn't support having two different submodels with the same name in a single model.

So what to do about it? I agree, we shouldn't just ignore the bug.

I don't see how aliasing would solve the issue. So a fix indeed could be to rename non-unique model names. Problem is that we don't know upfront which names collide. In addition, current implementation cycles through fields from top level to bottom and fields don't store information about higher level fields. I got it locally to work with just comparing current model with existing model in the class registry and adding a hash to class name based on its children if it's different.

As I see it, To add parent info to the model name would require a preprocess iteration over the Avro schema. In this iterations the colliding classes are identified along with their parents. I welcome a PR to implement this. I can provide my WIP if you'd like.

Before this is finished, I think pydantic-avro should probably raise an error on name collision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants