merging multiple tables – pd.concat: append vs yield

Question

Assume we have quite a few of .xls or .xlsx files stored in a directory, and there are two ways of feeding them into pd.concat to get one big table: yield vs append. Judging by %%timeit magic, both are pretty much the same? tested on 100 xls/xlsx files If there&#8217;s a difference between these two, which on…

Accepted Answer

The pandas docs note that:It is worth noting that concat() (and therefore append()) makes a fullcopy of the data, and that constantly reusing this function can createa significant performance hit. If you need to use the operation overseveral datasets, use a list comprehension.  https://pandas.pydata.org/docs/user_guide/merging.htmlBased on your timing results, it looks like read_excel() is the slowest part.UPDATE:I would use the yield method.the function yield_method()returns a generator, which is empty after invoking pd.concat().  It doesn&#8217;t take space/resources, and communicates the intention that it has served its purpose.the append_method() returns a list of data frames, which will continue to consume space from the call to pd.concat() until the garbage collector runs.

Advertisement

Answer