First time posting here, I apologize if this question has been asked before – I can’t find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I’ve had an Excel Auto_Open macro that downloads several Excel files and double clicks on the “Grand Total” row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default – the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I’ll probably need to use win32com.client.
Does anybody have any experience with this, and know if it’s even possible? Any pointers for where I might get started? I’ve tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don’t know how to use them or they don’t return what I need.
Any help would be much appreciated. Thanks!
Advertisement
Answer
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl
are:
- Use
openpyxl
to get theopenpyxl.pivot.table.TableDefinition
object from the desired pivot table (let’s call itmy_pivot_table
) - Get cached fields and their values from
my_pivot_table.cache.cacheFields
- Get rows data as a
dict
in two sub-steps:- 3.1) Get all cached rows and their values from
my_pivot_table.cache.records.r
. Cache fields in these records are stored as indexes frommy_pivot_table.cache.cacheFields
- 3.2) Replace cache fields from each record by their actual values, by “joining”
cache.records.r
andcache.cacheFields
- 3.1) Get all cached rows and their values from
- Convert
dict
with rows into apandas
DataFrame
Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I’m adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
“Bare” code
import numpy as np import pandas as pd from openpyxl import load_workbook from openpyxl.pivot.fields import Missing file_path = 'path/to/your/file.xlsx' workbook = load_workbook(file_path) worksheet = workbook['Plan1'] # Name of desired pivot table (the same name that appears within Excel) pivot_name = 'Tabela dinĂ¢mica1' # Extract the pivot table object from the worksheet pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0] # Extract a dict of all cache fields and their respective values fields_map = {} for field in pivot_table.cache.cacheFields: if field.sharedItems.count > 0: fields_map[field.name] = [f.v for f in field.sharedItems._fields] # Extract all rows from cache records. Each row is initially parsed as a dict column_names = [field.name for field in pivot_table.cache.cacheFields] rows = [] for record in pivot_table.cache.records.r: # If some field in the record in missing, we replace it by NaN record_values = [ field.v if not isinstance(field, Missing) else np.nan for field in record._fields ] row_dict = {k: v for k, v in zip(column_names, record_values)} # Shared fields are mapped as an Index, so we replace the field index by its value for key in fields_map: row_dict[key] = fields_map[key][row_dict[key]] rows.append(row_dict) df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2) FUEL YEAR REGION STATE UNIT Jan Feb (...) 0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...) 1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
Some of the objects details
Object pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition
. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object> Parameters: name='Tabela dinĂ¢mica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...) Parameters: ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object> Parameters: (A LOT OF OMITTED STUFF...)
Object fields_map
(from cache.cacheFields
)
This is a dict
with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0, 2018.0, 2019.0, 2020.0], 'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'], 'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)], 'REGION': ['S', 'NE', 'N', (...)]}
Object row_dict
(before mapping)
Each row is a dict
with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields
(see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map 'Jan': 10719.983, 'Feb': 12482.281, 'FUEL': 0, # <<<--- index in fields_map 'Dec': 10818.094, 'STATE': 0, # <<<--- index in fields_map (...) 'UNIT': 'm3'}
Object row_dict
(after mapping)
After extracting raw values for cache fields from their indexes, we have a dict
that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map 'Jan': 10719.983, 'Feb': 12482.281, 'FUEL': 'GASOLINE (m3)', # extracted from fields_map 'Dec': 10818.094, 'STATE': 'TEXAS', # extracted from fields_map (...) 'UNIT': 'm3'}