Skip to content
Advertisement

merging multiple tables – pd.concat: append vs yield

Assume we have quite a few of .xls or .xlsx files stored in a directory, and there are two ways of feeding them into pd.concat to get one big table: yield vs append.

JavaScript

Judging by %%timeit magic, both are pretty much the same?

tested on 100 xls/xlsx files

JavaScript

If there’s a difference between these two, which one should be used?

Advertisement

Answer

The pandas docs note that:

It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. https://pandas.pydata.org/docs/user_guide/merging.html

Based on your timing results, it looks like read_excel() is the slowest part.

UPDATE: I would use the yield method.

  • the function yield_method()returns a generator, which is empty after invoking pd.concat(). It doesn’t take space/resources, and communicates the intention that it has served its purpose.
  • the append_method() returns a list of data frames, which will continue to consume space from the call to pd.concat() until the garbage collector runs.
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement