Skip to content
Advertisement

best way to iterate through elements of pandas Series

All of the following seem to be working for iterating through the elements of a pandas Series. I’m sure there’s more ways of doing it. What are the differences and which is the best way?

import pandas


arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3])

# 1
for el in arr:
    print(el)

# 2
for _, el in arr.iteritems():
    print(el)

# 3
for el in arr.array:
    print(el)

# 4
for el in arr.values:
    print(el)

# 5
for i in range(len(arr)):
    print(arr.iloc[i])

Advertisement

Answer

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:


For numpy-based Series, use s.to_numpy()

  1. If the Series is a python or numpy dtype, it’s usually fastest to iterate the underlying numpy ndarray:

    for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
    
    datetime
    iteration timings for datetime Series (no index)
    int float float + nan str string
    iteration timings for int Series (no index) iteration timings for float Series (no index) iteration timings for float Series (no index) iteration timings for str Series (no index) iteration timings for string Series (no index)
  2. To access the index, it’s actually fastest to enumerate() or zip() the numpy ndarray:

    for i, el in enumerate(s.to_numpy()): # if default range index
    
    for i, el in zip(s.index, s.to_numpy()): # if custom index
    

    Both are faster than the idiomatic s.items() / s.iteritems():

    datetime + index
    iteration timings for datetime Series (with index)
  3. To micro-optimize, switch to s.tolist() for shorter int/float/str Series:

    for el in s.to_numpy(): # if >100K elements
    
    for el in s.tolist(): # to micro-optimize if <100K elements
    

    Warning: Do not use list(s) as it doesn’t use compiled code which makes it slower.


For pandas-based Series, use s.array or s.items()

Pandas extension dtypes contain extra (meta)data, e.g.:

pandas dtype contents
Categorical 2 arrays
DatetimeTZ array + timezone metadata
Interval 2 arrays
Period array + frequency metadata

Converting these extension arrays to numpy “may be expensive” since it could involve copying/coercing the data, so:

  1. If the Series is a pandas extension dtype, it’s generally fastest to iterate the underlying pandas array:

    for el in s.array: # if dtype is pandas-only extension
    

    For example, with ~100 unique Categorical values:

    Categorical
    iteration timings for Categorical Series (no index)
    DatetimeTZ Period Interval
    iteration timings for DatetimeTZ Series (no index) iteration timings for Period Series (no index) iteration timings for Interval Series (no index)
  2. To access the index, the idiomatic s.items() is very fast for pandas dtypes:

    for i, el in s.items(): # if need index for pandas-only dtype
    
    DatetimeTZ + index Interval + index Period + index
    iteration timings for DatetimeTZ Series (with index) iteration timings for Interval Series (with index) iteration timings for Period Series (with index)
  3. To micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:

    for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
    
    Categorical + index
    iteration timings for Categorical Series (with index)

Caveats

  1. Avoid using s.values:

    • Use s.to_numpy() to get the underlying numpy ndarray
    • Use s.array to get the underlying pandas array
  2. Avoid modifying the iterated Series:

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!

  3. Avoid iterating manually whenever possible by instead:

    1. Vectorizing, (boolean) indexing, etc.

    2. Applying functions, e.g.:

      Note: These are not vectorizations despite the common misconception.

    3. Offloading to cython/numba


Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)
Versions: python==3.9.2, pandas==1.3.1, numpy==1.20.2
Testing data: Series generation code in snippet

'''
Note: This is python code in a js snippet, so "run code snippet" will not work.
The snippet is just to avoid cluttering the main post with supplemental code.
'''

import pandas as pd
import numpy as np

int_series = pd.Series(np.random.randint(1000000000, size=n))
float_series = pd.Series(np.random.randn(size=n))
floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))
str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)
string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')
datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))
datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))
categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')
interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))
period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement