Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.)
There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass pandas.DataFrame
that satisfies two general requirements:
- calling standard DataFrame methods on instances of MyDF should produce instances of MyDF
- calling standard DataFrame methods on instances of MyDF should leave all attributes still attached to the output
(And are there any significant differences for subclassing pandas.Series?)
Code for subclassing pd.DataFrame
:
import numpy as np import pandas as pd class MyDF(pd.DataFrame): # how to subclass pandas DataFrame? pass mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D']) print(type(mydf)) # <class '__main__.MyDF'> # Requirement 1: Instances of MyDF, when calling standard methods of DataFrame, # should produce instances of MyDF. mydf_sub = mydf[['A','C']] print(type(mydf_sub)) # <class 'pandas.core.frame.DataFrame'> # Requirement 2: Attributes attached to instances of MyDF, when calling standard # methods of DataFrame, should still attach to the output. mydf.myattr = 1 mydf_cp1 = MyDF(mydf) mydf_cp2 = mydf.copy() print(hasattr(mydf_cp1, 'myattr')) # False print(hasattr(mydf_cp2, 'myattr')) # False
Advertisement
Answer
There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.
The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas
The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py
As in HYRY’s answer, it seems there are two things you’re trying to accomplish:
- When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the
_constructor
property which should return your type. - Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special
_metadata
attribute.
Here’s an example:
class SubclassedDataFrame(DataFrame): _metadata = ['added_property'] added_property = 1 # This will be passed to copies @property def _constructor(self): return SubclassedDataFrame