How to extract elements from a filename and move them to different columns?

Tags: , , , ,



I have a filenames which I converted into a list. The list has the following elements:

list = ['15253_Variation.JPG',
 '15253_Variation_Tainted.JPG',
 '15253_Variation_O2_Saxophone.PNG',
 '15253_Variation_O2_Saxophone.jpg',
 '15253_Variation_O2_Saxophone_reference.png',
 '15253_Variation_Side1.JPG',
 '15253_Variation_Side2.JPG']

My goal is to extract elements from this list and fill out a dataframe, which should look like this:

enter image description here

LINK TO THE GOOGLE SHEETS CONTAINING THE IMAGE ABOVE: https://docs.google.com/spreadsheets/d/1kuX3M4RFCNWtNoE7Hm1ejxWMwF-Cs4p8SsjA3JzdidA/edit?usp=sharing

WHAT I’VE DONE SO FAR is the following code:

Obj = pd.DataFrame(data = list, index = None, columns = ['file'])
new_list = []
for i in Obj['file']:
       new_list.append(i.split('_'))

But, this one does not leave empty spaces thus not doing what I needed.

Thank you very much in advanced.

Answer

As per number of the comments. It’s a pain because the tokens in the filename are not fully fixed format. Quite a lot of conditional logic

  1. have defined two additional lists mi instruments and oxygen whatever it is.
  2. first pass is in building dict that is pandas standard format
  3. then work through conditional logic after have base data frame
# don't name it list - it override python list()!
l = ['15253_Variation.JPG',
 '15253_Variation_Tainted.JPG',
 '15253_Variation_O2_Saxophone.PNG',
 '15253_Variation_O2_Saxophone.jpg',
 '15253_Variation_O2_Saxophone_reference.png',
 '15253_Variation_Side1.JPG',
 '15253_Variation_Side2.JPG']

issues = ["Tainted","Perfect"]
mi = ["Saxophone"]
oxygen = ["O2"]

# first pass using dict/list comprehensions
df = pd.DataFrame({"filename":{i:f.split(".")[0] for i,f in enumerate(l)},
             "Number":{i:f.split("_")[0] for i,f in enumerate(l)}, 
              "Name":{i:f.split("_")[1].split(".")[0] for i,f in enumerate(l)}, 
              "Location2":{},
              "Issues":{}, "Oxygen":{},"Location":{}, "Musical Instrument":{},
             "Ref":{}, 
              "Extension":{i:f.split(".")[1] for i,f in enumerate(l)}})

df = df.assign(**{
    # list of tokens for checking fixed lists against
    "Tokens":lambda dfa: dfa.apply(lambda s: s["filename"].split("_")[2:], axis=1),
    "Issues":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s, issues))[0][0]] 
                                           if np.isin(s, issues).any() else np.nan),
    "Musical Instrument":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s, mi))[0][0]] 
                                           if np.isin(s, mi).any() else np.nan),
    "Oxygen":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s, oxygen))[0][0]] 
                                           if np.isin(s, oxygen).any() else np.nan),
}).assign(**{
    # let's do tokens again minus ones already placed
    "Tokens":lambda dfa: dfa.apply(lambda s: [t for t in s["filename"].split("_")[2:] 
                                             if not(t==s["Issues"] 
                                                    or t==s["Musical Instrument"]
                                                   or t==s["Oxygen"])], axis=1),
    "Location2":lambda dfa: dfa.apply(lambda s: s["Tokens"][0] if len(s["Tokens"])>0 
                                      and "Side" in s["Tokens"][0] else np.nan, axis=1),
    "Ref":lambda dfa: dfa.apply(lambda s: s["Tokens"][0] if len(s["Tokens"])>0 
                                      and "Side" not in s["Tokens"][0] else np.nan, axis=1)

}).drop(columns=["Tokens","filename"])

print(df.to_string(index=False))

output

Number       Name Location2   Issues Oxygen  Location Musical Instrument        Ref Extension
 15253  Variation       NaN      NaN    NaN       NaN                NaN        NaN       JPG
 15253  Variation       NaN  Tainted    NaN       NaN                NaN        NaN       JPG
 15253  Variation       NaN      NaN     O2       NaN          Saxophone        NaN       PNG
 15253  Variation       NaN      NaN     O2       NaN          Saxophone        NaN       jpg
 15253  Variation       NaN      NaN     O2       NaN          Saxophone  reference       png
 15253  Variation     Side1      NaN    NaN       NaN                NaN        NaN       JPG
 15253  Variation     Side2      NaN    NaN       NaN                NaN        NaN       JPG


Source: stackoverflow