Skip to content
Advertisement

How to optimize time while converting list to dataframe?(Part II)

I didn’t get any proper answers to my previous question:

How to optimize time while converting list to dataframe?

Let me explain the example more:

Let’s consider the data frame more precisely as

First Name   Last Name    Country  Address      Age  Age-Group  Photo1 Photo2 Phototype
Mark         Shelby        US      Petersburg    42   Adult     1.jpg  2.jpg    PP
Andy         Carnot        GE      Freiburg      16    Teen     1.jpg           PP

I want the output dataframe when converted to csv as

N,Mark,Shelby,US
AG,43,Adult
AD,Petersburg
PH,1.jpg,PP
PH,2.jpg,PP
N,Andy,Carnot,GE
AG,16,Teen
AD,Freiburg
PH,1.jpg,PP

The character PH,AG, AD,N should not be mapped. It can be any character.

This works fine while looping over this list and mapping and converting to a data frame. But, this took a lot of time for large datasets. The exact code for this process is in the previous question

[['N','First Name','Last Name', 'Country'],
 ['AG','Age','Age-Group'],
 ['AD','Address'],
 ['PH','Photo1','Phototype'],
 ['PH','Photo2','Phototype'], 
 ]

Advertisement

Answer

Use:

print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   

  Phototype  
0        PP  
1        PP

First is defined dictionary for keys by first value in final list with all columns strating with strings:

d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo','Phototype']}

Then filter DataFrames by lists from dictionary:

out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}

For PH is necessary melting for change format:

out['PH'] = (out['PH'].melt('Phototype', 
                           value_name='Photo',
                           ignore_index=False)
                      .drop('variable',1)[['Photo','Phototype']]
                      .dropna(subset=['Photo']))

Last create same columns and join by concat with sorting for correct ordering:

out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}

df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')
print (df)
  level_0           0       1   2
0       N        Mark  Shelby  US
0      AG          42   Adult    
0      AD  Petersburg            
0      PH       1.jpg      PP    
0      PH       2.jpg      PP    
1       N        Andy  Carnot  GE
1      AG          16    Teen    
1      AD    Freiburg            
1      PH       1.jpg      PP   

Last create lists with different lengths by remove empty strings:

fin = [x[x!= ''].tolist() for x in df.to_numpy() ]
print (fin)
[['N', 'Mark', 'Shelby', 'US'],
 ['AG', 42, 'Adult'],
 ['AD', 'Petersburg'], 
 ['PH', '1.jpg', 'PP'], 
 ['PH', '2.jpg', 'PP'], 
 ['N', 'Andy', 'Carnot', 'GE'], 
 ['AG', 16, 'Teen'], 
 ['AD', 'Freiburg'], 
 ['PH', '1.jpg', 'PP']]

EDIT: For match Photo with digits is use regex, so instead startswith is used contains with joined values of lists by | for regex OR:

d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photod+','Phototype']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out)
{'N':   First Name Last Name Country
0       Mark    Shelby      US
1       Andy    Carnot      GE, 'AG':    Age Age-Group
0   42     Adult
1   16      Teen, 'AD':       Address
0  Petersburg
1    Freiburg, 'PH':   Photo1 Photo2 Phototype
0  1.jpg  2.jpg        PP
1  1.jpg    NaN        PP}

EDIT: Trick is add ^ to start of strings and $ to end of string for exact match values, then is necessary for correct working Photo+ ‘digit’:

print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   

  Phototype Age Detail Address Detail  
0        PP      Young            Far  
1        PP  Too Young           Near  


d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photod+','Phototype']}

d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'], 
 'AG': ['^Age$', '^Age-Group$'], 
 'AD': ['^Address$'], 
 'PH': ['^Photo\d+$', '^Phototype$']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}

print (out['AG'])
   Age Age-Group
0   42     Adult
1   16      Teen

print (out['AD'])
      Address
0  Petersburg
1    Freiburg
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement