Pydantic for data validation? #502

juanitorduz · 2024-01-26T13:38:13Z

At the end of #498, we touched on a point it was been on my mind for a while now.

Shall we use pydantic for data validation?
I have worked with Pydantic on many projects, and I love it! It is super fast and actively maintained! See for example the data generation process in https://juanitorduz.github.io/multilevel_elasticities_single_sku/
This would provide a modern and elegant way to validate data (input data and parameters). If we agree on doing it I would be happy to kick-off this initiative 😄 .

The text was updated successfully, but these errors were encountered:

ColtAllen · 2024-01-26T18:00:02Z

Is it common for model parameters to be incorrectly specified? If not I think pydantic is overkill. It's great for data pipelines, but although pydantic can validate if an input is a pandas DataFrame, it can't validate the contents of that dataframe. Same with a dict for model_config.

juanitorduz · 2024-01-26T18:06:37Z

Actually I think you can add custom checks across the fields (eg the data frame). Look in the example I shared above I have something like

from pydantic import BaseModel, Field,  field_validator


class Region(BaseModel):
    id: int = Field(..., ge=0)
    stores: list[Store] = Field(..., min_items=1)
    median_income: float = Field(..., gt=0)

    @field_validator("stores")
    def validate_store_ids(cls, value):
        if len({store.id for store in value}) != len(value):
            raise ValueError("stores must have unique ids")
        return value

    def to_dataframe(self) -> pd.DataFrame:
        df = pd.concat([store.to_dataframe() for store in self.stores], axis=0)
        df["region_id"] = self.id
        df["median_income"] = self.median_income
        return df.reset_index(drop=True)

Which is a custom check :)

ColtAllen · 2024-01-26T18:27:52Z

I did look at it, and abandoned editing my previous post when you replied haha.

I've used pandera in the past for validating dataframes, but feel it's too specialized to add as a library requirement. In general I'm in favor of keeping requirements to a minimum and not adding any additional development overhead unless this is a significant problem we should go ahead and address?

On a related note, I created an issue to add a data validation utility method to the CLV module for users who provide their own RFM data, but I have other priorities at the moment.

juanitorduz · 2024-01-26T18:59:09Z

In general I'm in favor of keeping requirements to a minimum.

I also agree with this in general.

I think pydantic is a widely popular library so I don't think is as bad as a very niche one. Still, it is a fair point.

I think the problem we want to solve is to have a unified way for data and parameter validation. There is nothing wrong on how we are using it now, it is more about a nicer API. Still, I do not have a very strong option. I will investigate more and see if pydantic can bring us more benefits. I will also look into the data validation issue you mentioned.

Thanks for the feedback :)

ferrine · 2024-01-28T06:50:33Z

Pandera is great, it is as actively developed as pydantic

juanitorduz added request discussion maintenance labels Jan 26, 2024

juanitorduz mentioned this issue Jan 26, 2024

Add baselined saturation #498

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic for data validation? #502

Pydantic for data validation? #502

juanitorduz commented Jan 26, 2024

ColtAllen commented Jan 26, 2024

juanitorduz commented Jan 26, 2024 •

edited

Loading

ColtAllen commented Jan 26, 2024 •

edited

Loading

juanitorduz commented Jan 26, 2024 •

edited

Loading

ferrine commented Jan 28, 2024

Pydantic for data validation? #502

Pydantic for data validation? #502

Comments

juanitorduz commented Jan 26, 2024

ColtAllen commented Jan 26, 2024

juanitorduz commented Jan 26, 2024 • edited Loading

ColtAllen commented Jan 26, 2024 • edited Loading

juanitorduz commented Jan 26, 2024 • edited Loading

ferrine commented Jan 28, 2024

juanitorduz commented Jan 26, 2024 •

edited

Loading

ColtAllen commented Jan 26, 2024 •

edited

Loading

juanitorduz commented Jan 26, 2024 •

edited

Loading