Skip to content
Advertisement

Simple example of Pandas ExtensionArray

It seems to me that Pandas ExtensionArrays would be one of the cases where a simple example to get one started would really help. However, I have not found a simple enough example anywhere.

Creating an ExtensionArray

To create an ExtensionArray, you need to

There is also a section in the Pandas documentation with a brief overview.

Example implementations

There are many examples of implementations:

Question

Despite having studied all of the above, I still find extension arrays difficult to understand. All of the examples have a lot of specifics and custom functionality that makes it difficult to work out what is actually necessary. I suspect many have faced a similar problem.

I am thus asking for a simple and minimal example of a working ExtensionArray. The class should pass all the tests Pandas have provided to check that the ExtensionArray behaves as expected. I’ve provided an example implementation of the tests below.

To have a concrete example, let’s say I want to extend ExtensionArray to obtain an integer array that is able to hold NA values. That is essentially IntegerArray, but stripped of any actual functionality beyond the basics of ExtensionArray.


Testing the solution

I have used the following fixtures & tests to test the validity of the solution. These are based on the directions in the Pandas documentation

JavaScript

Advertisement

Answer

Update 2021-09-19

There were too many issues trying to get NullableIntArray to pass the test suite, so I’ve created a new example (AngleDtype + AngleArray) that currently passes 398 tests (fails 2).


0. Usage

(pandas 1.3.2, numpy 1.20.2, python 3.9.2)

AngleArray stores either radians or degrees depending on its unit (represented by AngleDtype):

JavaScript
JavaScript

AngleArray can be stored in a Series or DataFrame:

JavaScript
JavaScript
JavaScript
JavaScript

AngleArray computations are unit-aware:

JavaScript
JavaScript

1. AngleDtype

For every ExtensionDtype, 3 methods must be implemented concretely:

  • type
  • name
  • construct_array_type

For a parameterized ExtensionDtype (e.g., AngleDtype.unit or PeriodDtype.freq):

For the test suite:

  • __hash__
  • __eq__
  • __setstate__
JavaScript

2. AngleArray

For every ExtensionArray, 11 methods must be implemented concretely:

  • _from_sequence
  • _from_factorized
  • __getitem__
  • __len__
  • __eq__
  • dtype
  • nbytes
  • isna
  • take
  • copy
  • _concat_same_type

For the test suite:

  • Many more concrete methods are needed
  • Whenever a test prompted me to add a new method, I marked it with a comment (though this is not a comprehensive mapping since most methods are required by multiple tests)
JavaScript

3. pytest

JavaScript

There are two remaining test failures:

  1. TestMethodsTests.test_combine_le

    Currently this returns an AngleDtype Series of boolean values, but pandas wants the Series itself to be boolean (not sure how to resolve this without breaking other tests):

    JavaScript
  2. TestSetitemTests.test_setitem_scalar_key_sequence_raise

    Currently this puts a[[0, 1]] into index 0, but pandas expects an error:

    JavaScript

    Several of the pandas extension arrays use convoluted validation methods to catch these edge cases, e.g.:

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement