How to extract Excel PivotCache into Pandas Data Frame?

Question

First time posting here, I apologize if this question has been asked before &#8211; I can&#8217;t find anything that applies. Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I&#8217;ve had an Excel Auto_Open macro that downloads several Excel fil…

Accepted Answer

This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.In essence, the steps one can take to solve this with openpyxl are:Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let’s call it my_pivot_table)Get cached fields and their values from my_pivot_table.cache.cacheFieldsGet rows data as a dict in two sub-steps:3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields3.2) Replace cache fields from each record by their actual values, by “joining” cache.records.r and cache.cacheFieldsConvert dict with rows into a pandas DataFrameBelow you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I’m adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.“Bare” codeimport numpy as npimport pandas as pdfrom openpyxl import load_workbookfrom openpyxl.pivot.fields import Missingfile_path = 'path/to/your/file.xlsx'workbook = load_workbook(file_path)worksheet = workbook['Plan1']# Name of desired pivot table (the same name that appears within Excel)pivot_name = 'Tabela dinâmica1'# Extract the pivot table object from the worksheetpivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]# Extract a dict of all cache fields and their respective valuesfields_map = {}for field in pivot_table.cache.cacheFields: if field.sharedItems.count > 0: fields_map[field.name] = [f.v for f in field.sharedItems._fields]# Extract all rows from cache records. Each row is initially parsed as a dictcolumn_names = [field.name for field in pivot_table.cache.cacheFields]rows = []for record in pivot_table.cache.records.r: # If some field in the record in missing, we replace it by NaN record_values = [ field.v if not isinstance(field, Missing) else np.nan for field in record._fields ] row_dict = {k: v for k, v in zip(column_names, record_values)} # Shared fields are mapped as an Index, so we replace the field index by its value for key in fields_map: row_dict[key] = fields_map[key][row_dict[key]] rows.append(row_dict)df = pd.DataFrame.from_dict(rows)Results:>>> df.head(2) FUEL YEAR REGION STATE UNIT Jan Feb (...)0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)Some of the objects detailsObject pivot_tableThis is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:Parameters:name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)Parameters:ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[Parameters: (A LOT OF OMITTED STUFF...)Object fields_map (from cache.cacheFields)This is a dict with column name and their available values:{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0, 2018.0, 2019.0, 2020.0], 'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'], 'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)], 'REGION': ['S', 'NE', 'N', (...)]}Object row_dict (before mapping)Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above){'YEAR': 0, # <<<--- 0 stands for index in fields_map 'Jan': 10719.983, 'Feb': 12482.281, 'FUEL': 0, # <<<--- index in fields_map 'Dec': 10818.094, 'STATE': 0, # <<<--- index in fields_map (...) 'UNIT': 'm3'}Object row_dict (after mapping)After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:{'YEAR': 2000.0, # extracted column value from index in fields_map 'Jan': 10719.983, 'Feb': 12482.281, 'FUEL': 'GASOLINE (m3)', # extracted from fields_map 'Dec': 10818.094, 'STATE': 'TEXAS', # extracted from fields_map (...) 'UNIT': 'm3'}

How to extract Excel PivotCache into Pandas Data Frame?

Advertisement

Answer

“Bare” code

Some of the objects details

Object `pivot_table`

Object `fields_map` (from `cache.cacheFields`)

Object `row_dict` (before mapping)

Object `row_dict` (after mapping)