All of the following seem to be working for iterating through the elements of a pandas Series. I’m sure there’s more ways of doing it. What are the differences and which is the best way?
import pandas arr = pandas.Series([1, 1, 1, 2, 2, 2, 3, 3]) # 1 for el in arr: print(el) # 2 for _, el in arr.iteritems(): print(el) # 3 for el in arr.array: print(el) # 4 for el in arr.values: print(el) # 5 for i in range(len(arr)): print(arr.iloc[i])
Advertisement
Answer
TL;DR
Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.
However if Series iteration is absolutely necessary, performance will depend on the dtype and index:
Index | Fastest if numpy dtype | Fastest if pandas dtype | Idiomatic |
---|---|---|---|
Unneeded | in s.to_numpy() |
in s.array |
in s |
Default | in enumerate(s.to_numpy()) |
in enumerate(s.array) |
in s.items() |
Custom | in zip(s.index, s.to_numpy()) |
in s.items() |
in s.items() |
For numpy-based Series, use s.to_numpy()
If the Series is a python or numpy dtype, it’s usually fastest to iterate the underlying numpy ndarray:
for el in s.to_numpy(): # if dtype is datetime, int, float, str, string
To access the index, it’s actually fastest to
enumerate()
orzip()
the numpy ndarray:for i, el in enumerate(s.to_numpy()): # if default range index
for i, el in zip(s.index, s.to_numpy()): # if custom index
Both are faster than the idiomatic
s.items()
/s.iteritems()
:To micro-optimize, switch to
s.tolist()
for shorterint
/float
/str
Series:for el in s.to_numpy(): # if >100K elements
for el in s.tolist(): # to micro-optimize if <100K elements
Warning: Do not use
list(s)
as it doesn’t use compiled code which makes it slower.
For pandas-based Series, use s.array
or s.items()
Pandas extension dtypes contain extra (meta)data, e.g.:
pandas dtype | contents |
---|---|
Categorical |
2 arrays |
DatetimeTZ |
array + timezone metadata |
Interval |
2 arrays |
Period |
array + frequency metadata |
… | … |
Converting these extension arrays to numpy “may be expensive” since it could involve copying/coercing the data, so:
If the Series is a pandas extension dtype, it’s generally fastest to iterate the underlying pandas array:
for el in s.array: # if dtype is pandas-only extension
For example, with ~100 unique
Categorical
values:To access the index, the idiomatic
s.items()
is very fast for pandas dtypes:for i, el in s.items(): # if need index for pandas-only dtype
To micro-optimize, switch to the slightly faster
enumerate()
for default-indexedCategorical
arrays:for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range index
Caveats
-
- Use
s.to_numpy()
to get the underlying numpy ndarray - Use
s.array
to get the underlying pandas array
- Use
Avoid modifying the iterated Series:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
Avoid iterating manually whenever possible by instead:
Vectorizing, (boolean) indexing, etc.
Applying functions, e.g.:
Note: These are not vectorizations despite the common misconception.
Specs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)
Versions: python==3.9.2
, pandas==1.3.1
, numpy==1.20.2
Testing data: Series generation code in snippet
''' Note: This is python code in a js snippet, so "run code snippet" will not work. The snippet is just to avoid cluttering the main post with supplemental code. ''' import pandas as pd import numpy as np int_series = pd.Series(np.random.randint(1000000000, size=n)) float_series = pd.Series(np.random.randn(size=n)) floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n)) str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str) string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string') datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n)) datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n)) categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category') interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n))) period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))