I didn’t get any proper answers to my previous question:
How to optimize time while converting list to dataframe?
Let me explain the example more:
Let’s consider the data frame more precisely as
First Name Last Name Country Address Age Age-Group Photo1 Photo2 Phototype Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg PP Andy Carnot GE Freiburg 16 Teen 1.jpg PP
I want the output dataframe when converted to csv as
N,Mark,Shelby,US AG,43,Adult AD,Petersburg PH,1.jpg,PP PH,2.jpg,PP N,Andy,Carnot,GE AG,16,Teen AD,Freiburg PH,1.jpg,PP
The character PH,AG, AD,N should not be mapped. It can be any character.
This works fine while looping over this list and mapping and converting to a data frame. But, this took a lot of time for large datasets. The exact code for this process is in the previous question
[['N','First Name','Last Name', 'Country'], ['AG','Age','Age-Group'], ['AD','Address'], ['PH','Photo1','Phototype'], ['PH','Photo2','Phototype'], ]
Advertisement
Answer
Use:
print (df) First Name Last Name Country Address Age Age-Group Photo1 Photo2 0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg 1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN Phototype 0 PP 1 PP
First is defined dictionary for keys by first value in final list with all columns strating with strings:
d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo','Phototype']}
Then filter DataFrames by lists from dictionary:
out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}
For PH is necessary melting for change format:
out['PH'] = (out['PH'].melt('Phototype', 
                           value_name='Photo',
                           ignore_index=False)
                      .drop('variable',1)[['Photo','Phototype']]
                      .dropna(subset=['Photo']))
Last create same columns and join by concat with sorting for correct ordering:
out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}
df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')
print (df)
  level_0           0       1   2
0       N        Mark  Shelby  US
0      AG          42   Adult    
0      AD  Petersburg            
0      PH       1.jpg      PP    
0      PH       2.jpg      PP    
1       N        Andy  Carnot  GE
1      AG          16    Teen    
1      AD    Freiburg            
1      PH       1.jpg      PP   
Last create lists with different lengths by remove empty strings:
fin = [x[x!= ''].tolist() for x in df.to_numpy() ] print (fin) [['N', 'Mark', 'Shelby', 'US'], ['AG', 42, 'Adult'], ['AD', 'Petersburg'], ['PH', '1.jpg', 'PP'], ['PH', '2.jpg', 'PP'], ['N', 'Andy', 'Carnot', 'GE'], ['AG', 16, 'Teen'], ['AD', 'Freiburg'], ['PH', '1.jpg', 'PP']]
EDIT: For match Photo with digits is use regex, so instead startswith is used contains with joined values of lists by | for regex OR:
d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photod+','Phototype']}
out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out)
{'N':   First Name Last Name Country
0       Mark    Shelby      US
1       Andy    Carnot      GE, 'AG':    Age Age-Group
0   42     Adult
1   16      Teen, 'AD':       Address
0  Petersburg
1    Freiburg, 'PH':   Photo1 Photo2 Phototype
0  1.jpg  2.jpg        PP
1  1.jpg    NaN        PP}
EDIT: Trick is add ^ to start of strings and $ to end of string for exact match values, then is necessary for correct working Photo+ ‘digit’:
print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   
  Phototype Age Detail Address Detail  
0        PP      Young            Far  
1        PP  Too Young           Near  
d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photod+','Phototype']}
d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'], 
 'AG': ['^Age$', '^Age-Group$'], 
 'AD': ['^Address$'], 
 'PH': ['^Photo\d+$', '^Phototype$']}
out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out['AG'])
   Age Age-Group
0   42     Adult
1   16      Teen
print (out['AD'])
      Address
0  Petersburg
1    Freiburg
