best way to iterate through elements of pandas Series

Question

All of the following seem to be working for iterating through the elements of a pandas Series. I'm sure there's more ways of doing it. What are the differences and which is the best way? Answer TL;DR Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing. However if Series iteration is

Accepted Answer

TL;DRIterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.However if Series iteration is absolutely necessary, performance will depend on the dtype and index:IndexFastest if numpy dtypeFastest if pandas dtypeIdiomaticUnneededin s.to_numpy()in s.arrayin sDefaultin enumerate(s.to_numpy())in enumerate(s.array)in s.items()Customin zip(s.index, s.to_numpy())in s.items()in s.items()For numpy-based Series, use s.to_numpy()If the Series is a python or numpy dtype, it&#8217;s usually fastest to iterate the underlying numpy ndarray:for el in s.to_numpy(): # if dtype is datetime, int, float, str, stringdatetimeintfloatfloat + nanstrstringTo access the index, it&#8217;s actually fastest to enumerate() or zip() the numpy ndarray:for i, el in enumerate(s.to_numpy()): # if default range indexfor i, el in zip(s.index, s.to_numpy()): # if custom indexBoth are faster than the idiomatic s.items() / s.iteritems():datetime + indexTo micro-optimize, switch to s.tolist() for shorter int/float/str Series:for el in s.to_numpy(): # if >100K elementsfor el in s.tolist(): # to micro-optimize if <100K elementsWarning: Do not use list(s) as it doesn&#8217;t use compiled code which makes it slower.For pandas-based Series, use s.array or s.items()Pandas extension dtypes contain extra (meta)data, e.g.:pandas dtypecontentsCategorical2 arraysDatetimeTZarray + timezone metadataInterval2 arraysPeriodarray + frequency metadata&#8230;&#8230;Converting these extension arrays to numpy &#8220;may be expensive&#8221; since it could involve copying/coercing the data, so:If the Series is a pandas extension dtype, it&#8217;s generally fastest to iterate the underlying pandas array:for el in s.array: # if dtype is pandas-only extensionFor example, with ~100 unique Categorical values:CategoricalDatetimeTZPeriodIntervalTo access the index, the idiomatic s.items() is very fast for pandas dtypes:for i, el in s.items(): # if need index for pandas-only dtypeDatetimeTZ + indexInterval + indexPeriod + indexTo micro-optimize, switch to the slightly faster enumerate() for default-indexed Categorical arrays:for i, el in enumerate(s.array): # to micro-optimize Categorical dtype if need default range indexCategorical + indexCaveatsAvoid using s.values:Use s.to_numpy() to get the underlying numpy ndarrayUse s.array to get the underlying pandas arrayAvoid modifying the iterated Series:You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!Avoid iterating manually whenever possible by instead:Vectorizing, (boolean) indexing, etc.Applying functions, e.g.:s.apply(some_function)s.agg(['min', 'max', 'mean'])s.transform([np.sqrt, np.exp])Note: These are not vectorizations despite the common misconception.Offloading to cython/numbaSpecs: ThinkPad X1 Extreme Gen 3 (Core i7-10850H 2.70GHz, 32GB DDR4 2933MHz)Versions: python==3.9.2, pandas==1.3.1, numpy==1.20.2Testing data: Series generation code in snippet'''Note: This is python code in a js snippet, so "run code snippet" will not work.The snippet is just to avoid cluttering the main post with supplemental code.'''import pandas as pdimport numpy as npint_series = pd.Series(np.random.randint(1000000000, size=n))float_series = pd.Series(np.random.randn(size=n))floatnan_series = pd.Series(np.random.choice([np.nan, np.inf]*n + np.random.randn(n).tolist(), size=n))str_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype(str)string_series = pd.Series(np.random.randint(10000000000000000, size=n)).astype('string')datetime_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01'), size=n))datetimetz_series = pd.Series(np.random.choice(pd.date_range('2000-01-01', '2021-01-01', tz='CET'), size=n))categorical_series = pd.Series(np.random.randint(100, size=n)).astype('category')interval_series = pd.Series(pd.arrays.IntervalArray.from_arrays(-np.random.random(size=n), np.random.random(size=n)))period_series = pd.Series(pd.period_range(end='2021-01-01', periods=n, freq='s'))

Index	Fastest if numpy dtype	Fastest if pandas dtype	Idiomatic
^Unneeded	^{in s.to_numpy()}	^{in s.array}	^{in s}
^Default	^{in enumerate(s.to_numpy())}	^{in enumerate(s.array)}	^{in s.items()}
^Custom	^{in zip(s.index, s.to_numpy())}	^{in s.items()}	^{in s.items()}

best way to iterate through elements of pandas Series

Advertisement

Answer

TL;DR

For numpy-based Series, use `s.to_numpy()`

For pandas-based Series, use `s.array` or `s.items()`

Caveats

pandas dtype	contents
`Categorical`	2 arrays
`DatetimeTZ`	array + timezone metadata
`Interval`	2 arrays
`Period`	array + frequency metadata
…	…