Skip to content
Advertisement

Type hints for a pandas DataFrame with mixed dtypes

I’ve been looking for robust type hints for a pandas DataFrame, but cannot seem to find anything useful. This question barely scratches the surface Pythonic type hints with pandas?

Normally if I want to hint the type of a function, that has a DataFrame as an input argument I would do:

import pandas as pd 
def func(arg: pd.DataFrame) -> int: 
     return 1

What I cannot seem to find is how do I type hint a DataFrame with mixed dtypes. The DataFrame constructor supports only type definition of the complete DataFrame. So to my knowledge changes in the dtypes can only occur afterwards with the pd.DataFrame().astype(dtypes={}) function.

This here works, but doesn’t seem very pythonic to me

import datetime
def func(arg: pd.DataFrame(columns=['integer', 'date']).astype(dtype={'integer': int, 'date': datetime.date})) -> int:
    return 1

I came across this package: https://pypi.org/project/dataenforce/ with examples such as this one:

def process_data(data: Dataset["id": int, "name": object, "latitude": float, "longitude": float])
  pass

This looks somewhat promising, but sadly the project is old and buggy.

As a data scientist, building a machine learning application with long ETL processes I believe that type hints are important.

What do you use and does anybody type hint their dataframes in pandas?

Advertisement

Answer

I have now found the pandera library that seems very promising:

https://github.com/pandera-dev/pandera

It allows users to create schemas and use those schemas to create verbose checks. From their docs:

https://pandera.readthedocs.io/en/stable/schema_models.html

import pandas as pd
import pandera as pa
from pandera.typing import Index, DataFrame, Series


class InputSchema(pa.SchemaModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)

class OutputSchema(InputSchema):
    revenue: Series[float]

@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
    return df.assign(revenue=100.0)


df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})

transform(df)

invalid_df = pd.DataFrame({
    "year": ["2001", "2002", "1999"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
})
transform(invalid_df)

Also a note from them:

Due to current limitations in the pandas library (see discussion here), pandera annotations are only used for run-time validation and cannot be leveraged by static-type checkers like mypy. See the discussion here for more details.

But still, even though there is no static-type checking I think that this is going in a very good direction.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement