Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic for data validation? #502

Open
juanitorduz opened this issue Jan 26, 2024 · 5 comments
Open

Pydantic for data validation? #502

juanitorduz opened this issue Jan 26, 2024 · 5 comments

Comments

@juanitorduz
Copy link
Collaborator

At the end of #498, we touched on a point it was been on my mind for a while now.

Shall we use pydantic for data validation?
I have worked with Pydantic on many projects, and I love it! It is super fast and actively maintained! See for example the data generation process in https://juanitorduz.github.io/multilevel_elasticities_single_sku/
This would provide a modern and elegant way to validate data (input data and parameters). If we agree on doing it I would be happy to kick-off this initiative 😄 .

@ColtAllen
Copy link
Collaborator

Is it common for model parameters to be incorrectly specified? If not I think pydantic is overkill. It's great for data pipelines, but although pydantic can validate if an input is a pandas DataFrame, it can't validate the contents of that dataframe. Same with a dict for model_config.

@juanitorduz
Copy link
Collaborator Author

juanitorduz commented Jan 26, 2024

Actually I think you can add custom checks across the fields (eg the data frame). Look in the example I shared above I have something like

from pydantic import BaseModel, Field,  field_validator


class Region(BaseModel):
    id: int = Field(..., ge=0)
    stores: list[Store] = Field(..., min_items=1)
    median_income: float = Field(..., gt=0)

    @field_validator("stores")
    def validate_store_ids(cls, value):
        if len({store.id for store in value}) != len(value):
            raise ValueError("stores must have unique ids")
        return value

    def to_dataframe(self) -> pd.DataFrame:
        df = pd.concat([store.to_dataframe() for store in self.stores], axis=0)
        df["region_id"] = self.id
        df["median_income"] = self.median_income
        return df.reset_index(drop=True)

Which is a custom check :)

@ColtAllen
Copy link
Collaborator

ColtAllen commented Jan 26, 2024

I did look at it, and abandoned editing my previous post when you replied haha.

I've used pandera in the past for validating dataframes, but feel it's too specialized to add as a library requirement. In general I'm in favor of keeping requirements to a minimum and not adding any additional development overhead unless this is a significant problem we should go ahead and address?

On a related note, I created an issue to add a data validation utility method to the CLV module for users who provide their own RFM data, but I have other priorities at the moment.

@juanitorduz
Copy link
Collaborator Author

juanitorduz commented Jan 26, 2024

In general I'm in favor of keeping requirements to a minimum.

I also agree with this in general.

I think pydantic is a widely popular library so I don't think is as bad as a very niche one. Still, it is a fair point.

I think the problem we want to solve is to have a unified way for data and parameter validation. There is nothing wrong on how we are using it now, it is more about a nicer API. Still, I do not have a very strong option. I will investigate more and see if pydantic can bring us more benefits. I will also look into the data validation issue you mentioned.

Thanks for the feedback :)

@ferrine
Copy link
Contributor

ferrine commented Jan 28, 2024

Pandera is great, it is as actively developed as pydantic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants