How to optimize time while converting list to dataframe?(Part II)

Question

I didn't get any proper answers to my previous question: How to optimize time while converting list to dataframe? Let me explain the example more: Let's consider the data frame more precisely as I want the output dataframe when converted to csv as The character PH,AG, AD,N should not be mapped. It can be any character. This works fine while

Accepted Answer

Use:print (df)  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN     Phototype  0        PP  1        PPFirst is defined dictionary for keys by first value in final list with all columns strating with strings:d = {'N':['First Name','Last Name', 'Country'],     'AG':['Age','Age-Group'],     'AD':['Address'],     'PH':['Photo','Phototype']}Then filter DataFrames by lists from dictionary:out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}For PH is necessary melting for change format:out['PH'] = (out['PH'].melt('Phototype',                            value_name='Photo',                           ignore_index=False)                      .drop('variable',1)[['Photo','Phototype']]                      .dropna(subset=['Photo']))Last create same columns and join by concat with sorting for correct ordering:out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')print (df)  level_0           0       1   20       N        Mark  Shelby  US0      AG          42   Adult    0      AD  Petersburg            0      PH       1.jpg      PP    0      PH       2.jpg      PP    1       N        Andy  Carnot  GE1      AG          16    Teen    1      AD    Freiburg            1      PH       1.jpg      PP   Last create lists with different lengths by remove empty strings:fin = [x[x!= ''].tolist() for x in df.to_numpy() ]print (fin)[['N', 'Mark', 'Shelby', 'US'], ['AG', 42, 'Adult'], ['AD', 'Petersburg'],  ['PH', '1.jpg', 'PP'],  ['PH', '2.jpg', 'PP'],  ['N', 'Andy', 'Carnot', 'GE'],  ['AG', 16, 'Teen'],  ['AD', 'Freiburg'],  ['PH', '1.jpg', 'PP']]EDIT: For match Photo with digits is use regex, so instead startswith is used contains with joined values of lists by | for regex OR:d = {'N':['First Name','Last Name', 'Country'],     'AG':['Age','Age-Group'],     'AD':['Address'],     'PH':['Photod+','Phototype']}out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}print (out){'N':   First Name Last Name Country0       Mark    Shelby      US1       Andy    Carnot      GE, 'AG':    Age Age-Group0   42     Adult1   16      Teen, 'AD':       Address0  Petersburg1    Freiburg, 'PH':   Photo1 Photo2 Phototype0  1.jpg  2.jpg        PP1  1.jpg    NaN        PP}EDIT: Trick is add ^ to start of strings and $ to end of string for exact match values, then is necessary for correct working Photo+ &#8216;digit&#8217;:print (df)  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN     Phototype Age Detail Address Detail  0        PP      Young            Far  1        PP  Too Young           Near  d = {'N':['First Name','Last Name', 'Country'],     'AG':['Age','Age-Group'],     'AD':['Address'],     'PH':['Photod+','Phototype']}d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}print (d){'N': ['^First Name$', '^Last Name$', '^Country$'],  'AG': ['^Age$', '^Age-Group$'],  'AD': ['^Address$'],  'PH': ['^Photo\d+$', '^Phototype$']}out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}print (out['AG'])   Age Age-Group0   42     Adult1   16      Teenprint (out['AD'])      Address0  Petersburg1    Freiburg

Advertisement

Answer