Skip to content
Advertisement

Drop rows in df based on if file name from df exists in folder

I have a dataframe which contains 40108 rows and a folder with pictures (only using a sample of the total 40108 pictures) containing 997 files. The file names of the images correspond to the rows in the column ‘imdbId’ in the df, with the addition that they have the .jpg suffix.

dataframe

I would like to drop all rows in my df where the names in the imdbId column doesnt have any corresponding file name in my folder and keep the rest. Meaning there should be 997 rows left after having run the code.

Example:

Position 1 in the df is 114709. A picture with name 114709.jpg doesnt exist in the folder, meaning this row should be dropped.

Position 2 in the df is 113497. A picture with name 113497.jpg exists in the folder. This row should remain. … and so on for all rows.

I have been trying to create an index with booleans and a for/if loop with os.path.isfile, but I cant manage to insert the imdbId from the df into any conditions correctly.

example from my notebook:

exists = os.path.isfile('moviegenre/SampleMoviePosters/**114709.jpg**')
if exists:
    # Do nothing, let the row remain.
else:
    # Drop row

Some help would be greatly appreciated. Thanks in advance.

Advertisement

Answer

Commonly “iterating” over rows or columns of a pandas dataframe is considered an antipattern and there are several alternatives to test before looping.

In this case you can apply a function to your “imdbid” column that obtain true or false for each value based on the existence of the image.

Then you can try to drop those rows does don’t have an image, but what is done in pandas usually is obtain a new dataframe (or a view of a dataframe) with only your data of interest.

As example:

# mkdir -p moviegenre/SampleMoviePosters/
# touch moviegenre/SampleMoviePosters/114709.jpg

import os
import pandas as pd

def image_exists(imdbid):
    filepath = f"moviegenre/SampleMoviePosters/{imdbid}.jpg"
    return os.path.isfile(filepath)

data = [[114709, 'Animation|Adventure|Comedy'], [113497, 'Action|Adventure|Family']]

df = pd.DataFrame(data, columns=['imdbid', 'Genre'])

df_with_images = df[df["imdbid"].apply(image_exists)]
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement