Drop rows in df based on if file name from df exists in folder

Question

I have a dataframe which contains 40108 rows and a folder with pictures (only using a sample of the total 40108 pictures) containing 997 files. The file names of the images correspond to the rows in the column &#8216;imdbId&#8217; in the df, with the addition that they have the .jpg suffix. I would like to dr…

Accepted Answer

Commonly &#8220;iterating&#8221; over rows or columns of a pandas dataframe is considered an antipattern and there are several alternatives to test before looping.In this case you can apply a function to your &#8220;imdbid&#8221; column that obtain true or false for each value based on the existence of the image.Then you can try to drop those rows does don&#8217;t have an image, but what is done in pandas usually is obtain a new dataframe (or a view of a dataframe) with only your data of interest.As example:# mkdir -p moviegenre/SampleMoviePosters/# touch moviegenre/SampleMoviePosters/114709.jpgimport osimport pandas as pddef image_exists(imdbid):    filepath = f"moviegenre/SampleMoviePosters/{imdbid}.jpg"    return os.path.isfile(filepath)data = [[114709, 'Animation|Adventure|Comedy'], [113497, 'Action|Adventure|Family']]df = pd.DataFrame(data, columns=['imdbid', 'Genre'])df_with_images = df[df["imdbid"].apply(image_exists)]

Advertisement

Answer