Assume we have quite a few of .xls
or .xlsx
files stored in a directory, and there are two ways of feeding them into pd.concat
to get one big table: yield vs append.
def append_method(path): container = [] for f in path.rglob("*.xls*"): df = pd.read_excel(f) container.append(df) return container def yield_method(path): for f in path.rglob("*.xls*"): df = pd.read_excel(f) yield df path = Path("...") ## append data = append_method(path) tables = pd.concat(data) ## yield data = yield_method(path) tables = pd.concat(data)
Judging by %%timeit
magic, both are pretty much the same?
tested on 100 xls/xlsx files
append: 11.8 s ± 667 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) yield: 12.7 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If there’s a difference between these two, which one should be used?
Advertisement
Answer
The pandas docs note that:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. https://pandas.pydata.org/docs/user_guide/merging.html
Based on your timing results, it looks like read_excel()
is the slowest part.
UPDATE: I would use the yield method.
- the function
yield_method()
returns a generator, which is empty after invokingpd.concat()
. It doesn’t take space/resources, and communicates the intention that it has served its purpose. - the
append_method()
returns a list of data frames, which will continue to consume space from the call to pd.concat() until the garbage collector runs.