Skip to content
Advertisement

How to take specific columns in pandas dataframe only if they exist (different CSVs)

I downloaded a bunch of football data from the internet in order to analyze it (around 30 CSV files). Each season’s game data is saved as a CSV file with different data columns. Some data columns are common to all files e.g. Home team, Away team, Full time result, ref name, etc…

Earlier years CSV data columns picture – These column are common to all CSVs

However in more recent years the data became richer and has some new data columns e.g. Corners for home team, Corners for away team, yellow cards for each team, shots on goal for each side, etc…

Recent years CSV data columns picture – Contains the common columns as well as additional ones

I made a generic function that take each season’s CSV gameweek data and turns it into a full table (how it looked at the end of the season) with different stats. Now when I try to build the “final-day” table of each season from the common data columns alone everything works out fine, However, when I try to throw in the uncommon columns (corners for example) I get an error. This is no surprise to me and i know how to check whether a CSV includes a certain column, but i’d like to know if there is a clever way to command the dataset to take a certain column if it exists (say ‘Corners’) and just skip this column if it does not exist.

I present part of the function that does rises the error. the last line is the problematic one. When I use leave only the common columns in (i.e. deleting every column after FTR) the function works fine. The code in general gets 1 season at the time and builds the table.

# create a pandas dataframe of a specific season before the season started
# returns a pandas dataframe with the year of the season and the teams involved with initialized stats
# path is the full path of the file retained by glob function, and raw_data is a pandas dataframe read directly from the CSV
def create_initial_table(path, raw_data):
# extracts the season's year
season_number = path[path.index("/") + 1:path.index(".")]
# reduce the information to the relevant columns
raw_data = raw_data[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']]

with this constellation i’d like to continue. i.e. when the column name does not exist just skip to the next column and so on till the ones that do exist remains and the ones that arent wont rise an error.

In later functions i also update the values of these columns (corners, shots on goal, etc…), so the same skip functionallity is needed there too.

Thanks for the advices :>

Advertisement

Answer

You can use DataFrame.filter(items=...) see this example:

all_columns = ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']
df = pd.DataFrame(np.random.rand(5, 5), columns=['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'BAD COLUMN'])

print(df)
   HomeTeam  AwayTeam      FTHG      FTAG  BAD COLUMN
0  0.265389  0.523248  0.093941  0.575946    0.929296
1  0.318569  0.667410  0.131798  0.716327    0.289406
2  0.183191  0.586513  0.020108  0.828940    0.004695
3  0.677817  0.270008  0.735194  0.962189    0.248753
4  0.576157  0.592042  0.572252  0.223082    0.952749

Even though I feed it column names that don’t exist in the dataframe, it will only pull out the columns that exist

new_df = df.filter(items=all_columns)

print(new_df)
   HomeTeam  AwayTeam      FTHG      FTAG
0  0.265389  0.523248  0.093941  0.575946
1  0.318569  0.667410  0.131798  0.716327
2  0.183191  0.586513  0.020108  0.828940
3  0.677817  0.270008  0.735194  0.962189
4  0.576157  0.592042  0.572252  0.223082
Advertisement