I’ve been looking for robust type hints for a pandas DataFrame, but cannot seem to find anything useful. This question barely scratches the surface Pythonic type hints with pandas?
Normally if I want to hint the type of a function, that has a DataFrame as an input argument I would do:
import pandas as pd def func(arg: pd.DataFrame) -> int: return 1
What I cannot seem to find is how do I type hint a DataFrame with mixed dtypes. The DataFrame constructor supports only type definition of the complete DataFrame. So to my knowledge changes in the dtypes can only occur afterwards with the pd.DataFrame().astype(dtypes={})
function.
This here works, but doesn’t seem very pythonic to me
import datetime def func(arg: pd.DataFrame(columns=['integer', 'date']).astype(dtype={'integer': int, 'date': datetime.date})) -> int: return 1
I came across this package: https://pypi.org/project/dataenforce/ with examples such as this one:
def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float]) pass
This looks somewhat promising, but sadly the project is old and buggy.
As a data scientist, building a machine learning application with long ETL processes I believe that type hints are important.
What do you use and does anybody type hint their dataframes in pandas?
Advertisement
Answer
I have now found the pandera library that seems very promising:
https://github.com/pandera-dev/pandera
It allows users to create schemas and use those schemas to create verbose checks. From their docs:
https://pandera.readthedocs.io/en/stable/schema_models.html
import pandas as pd import pandera as pa from pandera.typing import Index, DataFrame, Series class InputSchema(pa.SchemaModel): year: Series[int] = pa.Field(gt=2000, coerce=True) month: Series[int] = pa.Field(ge=1, le=12, coerce=True) day: Series[int] = pa.Field(ge=0, le=365, coerce=True) class OutputSchema(InputSchema): revenue: Series[float] @pa.check_types def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]: return df.assign(revenue=100.0) df = pd.DataFrame({ "year": ["2001", "2002", "2003"], "month": ["3", "6", "12"], "day": ["200", "156", "365"], }) transform(df) invalid_df = pd.DataFrame({ "year": ["2001", "2002", "1999"], "month": ["3", "6", "12"], "day": ["200", "156", "365"], }) transform(invalid_df)
Also a note from them:
Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and cannot be leveraged by static-type checkers like mypy. See the discussion here for more details.
But still, even though there is no static-type checking I think that this is going in a very good direction.