Skip to content
Advertisement

best way to iterate through elements of pandas Series

All of the following seem to be working for iterating through the elements of a pandas Series. I’m sure there’s more ways of doing it. What are the differences and which is the best way?

JavaScript

Advertisement

Answer

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:


For numpy-based Series, use s.to_numpy()

  1. If the Series is a python or numpy dtype, it’s usually fastest to iterate the underlying numpy ndarray:

    JavaScript
    datetime
    iteration timings for datetime Series (no index)
    int float float + nan str string
    iteration timings for int Series (no index) iteration timings for float Series (no index) iteration timings for float Series (no index) iteration timings for str Series (no index) iteration timings for string Series (no index)
  2. To access the index, it’s actually fastest to enumerate() or zip() the numpy ndarray:

    JavaScript
    JavaScript

    Both are faster than the idiomatic s.items() / s.iteritems():

    datetime + index
    iteration timings for datetime Series (with index)
  3. To micro-optimize, switch to s.tolist() for shorter int/float/str Series:

    JavaScript
    JavaScript

    Warning: Do not use list(s) as it doesn’t use compiled code which makes it slower.


For pandas-based Series, use s.array or s.items()

Pandas extension dtypes contain extra (meta)data, e.g.:

pandas dtype contents
Categorical 2 arrays
DatetimeTZ array + timezone metadata
Interval 2 arrays
Period array + frequency metadata

Converting these extension arrays to numpy “may be expensive” since it could involve copying/coercing the data, so:

  1. If the Series is a pandas extension dtype, it’s generally fastest to iterate the underlying pandas array:

    JavaScript

    For example, with ~100 unique Categorical values:

    Categorical
    iteration timings for Categorical Series (no index)
    DatetimeTZ Period Interval
    iteration timings for DatetimeTZ Series (no index) iteration timings for Period Series (no index) iteration timings for Interval Series (no index)
  2. To access the index, the idiomatic s.items() is very fast for pandas dtypes:

    JavaScript
    DatetimeTZ + index Interval + index Period + index
    iteration timings for DatetimeTZ Series (with index) iteration timings for Interval Series (with index) iteration timings for Period Series (with index)
  3. To micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:

    JavaScript
    Categorical + index
    iteration timings for Categorical Series (with index)

Caveats

  1. Avoid using s.values:

    • Use s.to_numpy() to get the underlying numpy ndarray
    • Use s.array to get the underlying pandas array
  2. Avoid modifying the iterated Series:

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!

  3. Avoid iterating manually whenever possible by instead:

    1. Vectorizing, (boolean) indexing, etc.

    2. Applying functions, e.g.:

      Note: These are not vectorizations despite the common misconception.

    3. Offloading to cython/numba


Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)
Versions: python==3.9.2, pandas==1.3.1, numpy==1.20.2
Testing data: Series generation code in snippet

JavaScript
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement