I have a filenames which I converted into a list. The list has the following elements:
list = ['15253_Variation.JPG', '15253_Variation_Tainted.JPG', '15253_Variation_O2_Saxophone.PNG', '15253_Variation_O2_Saxophone.jpg', '15253_Variation_O2_Saxophone_reference.png', '15253_Variation_Side1.JPG', '15253_Variation_Side2.JPG']
My goal is to extract elements from this list and fill out a dataframe, which should look like this:
LINK TO THE GOOGLE SHEETS CONTAINING THE IMAGE ABOVE: https://docs.google.com/spreadsheets/d/1kuX3M4RFCNWtNoE7Hm1ejxWMwF-Cs4p8SsjA3JzdidA/edit?usp=sharing
WHAT I’VE DONE SO FAR is the following code:
Obj = pd.DataFrame(data = list, index = None, columns = ['file']) new_list = [] for i in Obj['file']: new_list.append(i.split('_'))
But, this one does not leave empty spaces thus not doing what I needed.
Thank you very much in advanced.
Advertisement
Answer
As per number of the comments. It’s a pain because the tokens in the filename are not fully fixed format. Quite a lot of conditional logic
- have defined two additional lists
mi
instruments andoxygen
whatever it is. - first pass is in building
dict
that is pandas standard format - then work through conditional logic after have base data frame
# don't name it list - it override python list()! l = ['15253_Variation.JPG', '15253_Variation_Tainted.JPG', '15253_Variation_O2_Saxophone.PNG', '15253_Variation_O2_Saxophone.jpg', '15253_Variation_O2_Saxophone_reference.png', '15253_Variation_Side1.JPG', '15253_Variation_Side2.JPG'] issues = ["Tainted","Perfect"] mi = ["Saxophone"] oxygen = ["O2"] # first pass using dict/list comprehensions df = pd.DataFrame({"filename":{i:f.split(".")[0] for i,f in enumerate(l)}, "Number":{i:f.split("_")[0] for i,f in enumerate(l)}, "Name":{i:f.split("_")[1].split(".")[0] for i,f in enumerate(l)}, "Location2":{}, "Issues":{}, "Oxygen":{},"Location":{}, "Musical Instrument":{}, "Ref":{}, "Extension":{i:f.split(".")[1] for i,f in enumerate(l)}}) df = df.assign(**{ # list of tokens for checking fixed lists against "Tokens":lambda dfa: dfa.apply(lambda s: s["filename"].split("_")[2:], axis=1), "Issues":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s, issues))[0][0]] if np.isin(s, issues).any() else np.nan), "Musical Instrument":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s, mi))[0][0]] if np.isin(s, mi).any() else np.nan), "Oxygen":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s, oxygen))[0][0]] if np.isin(s, oxygen).any() else np.nan), }).assign(**{ # let's do tokens again minus ones already placed "Tokens":lambda dfa: dfa.apply(lambda s: [t for t in s["filename"].split("_")[2:] if not(t==s["Issues"] or t==s["Musical Instrument"] or t==s["Oxygen"])], axis=1), "Location2":lambda dfa: dfa.apply(lambda s: s["Tokens"][0] if len(s["Tokens"])>0 and "Side" in s["Tokens"][0] else np.nan, axis=1), "Ref":lambda dfa: dfa.apply(lambda s: s["Tokens"][0] if len(s["Tokens"])>0 and "Side" not in s["Tokens"][0] else np.nan, axis=1) }).drop(columns=["Tokens","filename"]) print(df.to_string(index=False))
output
Number Name Location2 Issues Oxygen Location Musical Instrument Ref Extension 15253 Variation NaN NaN NaN NaN NaN NaN JPG 15253 Variation NaN Tainted NaN NaN NaN NaN JPG 15253 Variation NaN NaN O2 NaN Saxophone NaN PNG 15253 Variation NaN NaN O2 NaN Saxophone NaN jpg 15253 Variation NaN NaN O2 NaN Saxophone reference png 15253 Variation Side1 NaN NaN NaN NaN NaN JPG 15253 Variation Side2 NaN NaN NaN NaN NaN JPG