Skip to content
Advertisement

merging multiple tables – pd.concat: append vs yield

Assume we have quite a few of .xls or .xlsx files stored in a directory, and there are two ways of feeding them into pd.concat to get one big table: yield vs append.

def append_method(path):
    container = []
    for f in path.rglob("*.xls*"):
        df = pd.read_excel(f)
        container.append(df)
    return container


def yield_method(path):
    for f in path.rglob("*.xls*"):
        df = pd.read_excel(f)
        yield df

        
path = Path("...")
        
## append
data = append_method(path)
tables = pd.concat(data)

## yield 
data = yield_method(path)
tables = pd.concat(data)

Judging by %%timeit magic, both are pretty much the same?

tested on 100 xls/xlsx files

append: 11.8 s ± 667 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 yield: 12.7 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If there’s a difference between these two, which one should be used?

Advertisement

Answer

The pandas docs note that:

It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. https://pandas.pydata.org/docs/user_guide/merging.html

Based on your timing results, it looks like read_excel() is the slowest part.

UPDATE: I would use the yield method.

  • the function yield_method()returns a generator, which is empty after invoking pd.concat(). It doesn’t take space/resources, and communicates the intention that it has served its purpose.
  • the append_method() returns a list of data frames, which will continue to consume space from the call to pd.concat() until the garbage collector runs.
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement