I didn’t get any proper answers to my previous question:
How to optimize time while converting list to dataframe?
Let me explain the example more:
Let’s consider the data frame more precisely as
First Name Last Name Country Address Age Age-Group Photo1 Photo2 Phototype Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg PP Andy Carnot GE Freiburg 16 Teen 1.jpg PP
I want the output dataframe when converted to csv as
N,Mark,Shelby,US AG,43,Adult AD,Petersburg PH,1.jpg,PP PH,2.jpg,PP N,Andy,Carnot,GE AG,16,Teen AD,Freiburg PH,1.jpg,PP
The character PH,AG, AD,N should not be mapped. It can be any character.
This works fine while looping over this list and mapping and converting to a data frame. But, this took a lot of time for large datasets. The exact code for this process is in the previous question
[['N','First Name','Last Name', 'Country'], ['AG','Age','Age-Group'], ['AD','Address'], ['PH','Photo1','Phototype'], ['PH','Photo2','Phototype'], ]
Advertisement
Answer
Use:
print (df) First Name Last Name Country Address Age Age-Group Photo1 Photo2 0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg 1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN Phototype 0 PP 1 PP
First is defined dictionary for keys by first value in final list with all columns strating with strings:
d = {'N':['First Name','Last Name', 'Country'], 'AG':['Age','Age-Group'], 'AD':['Address'], 'PH':['Photo','Phototype']}
Then filter DataFrames by lists from dictionary:
out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}
For PH
is necessary melting for change format:
out['PH'] = (out['PH'].melt('Phototype', value_name='Photo', ignore_index=False) .drop('variable',1)[['Photo','Phototype']] .dropna(subset=['Photo']))
Last create same columns and join by concat
with sorting for correct ordering:
out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()} df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('') print (df) level_0 0 1 2 0 N Mark Shelby US 0 AG 42 Adult 0 AD Petersburg 0 PH 1.jpg PP 0 PH 2.jpg PP 1 N Andy Carnot GE 1 AG 16 Teen 1 AD Freiburg 1 PH 1.jpg PP
Last create lists with different lengths by remove empty strings:
fin = [x[x!= ''].tolist() for x in df.to_numpy() ] print (fin) [['N', 'Mark', 'Shelby', 'US'], ['AG', 42, 'Adult'], ['AD', 'Petersburg'], ['PH', '1.jpg', 'PP'], ['PH', '2.jpg', 'PP'], ['N', 'Andy', 'Carnot', 'GE'], ['AG', 16, 'Teen'], ['AD', 'Freiburg'], ['PH', '1.jpg', 'PP']]
EDIT: For match Photo
with digits is use regex, so instead startswith
is used contains
with joined values of lists by |
for regex OR
:
d = {'N':['First Name','Last Name', 'Country'], 'AG':['Age','Age-Group'], 'AD':['Address'], 'PH':['Photod+','Phototype']} out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()} print (out) {'N': First Name Last Name Country 0 Mark Shelby US 1 Andy Carnot GE, 'AG': Age Age-Group 0 42 Adult 1 16 Teen, 'AD': Address 0 Petersburg 1 Freiburg, 'PH': Photo1 Photo2 Phototype 0 1.jpg 2.jpg PP 1 1.jpg NaN PP}
EDIT: Trick is add ^
to start of strings and $
to end of string for exact match values, then is necessary for correct working Photo
+ ‘digit’:
print (df) First Name Last Name Country Address Age Age-Group Photo1 Photo2 0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg 1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN Phototype Age Detail Address Detail 0 PP Young Far 1 PP Too Young Near d = {'N':['First Name','Last Name', 'Country'], 'AG':['Age','Age-Group'], 'AD':['Address'], 'PH':['Photod+','Phototype']} d = {k: [rf'^{x}$' for x in v] for k, v in d.items()} print (d) {'N': ['^First Name$', '^Last Name$', '^Country$'], 'AG': ['^Age$', '^Age-Group$'], 'AD': ['^Address$'], 'PH': ['^Photo\d+$', '^Phototype$']} out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()} print (out['AG']) Age Age-Group 0 42 Adult 1 16 Teen print (out['AD']) Address 0 Petersburg 1 Freiburg