How to take specific columns in pandas dataframe only if they exist (different CSVs)

Question

I downloaded a bunch of football data from the internet in order to analyze it (around 30 CSV files). Each season's game data is saved as a CSV file with different data columns. Some data columns are common to all files e.g. Home team, Away team, Full time result, ref name, etc... Earlier years CSV data columns picture - These

Accepted Answer

You can use DataFrame.filter(items=...) see this example:all_columns = ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']df = pd.DataFrame(np.random.rand(5, 5), columns=['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'BAD COLUMN'])print(df)   HomeTeam  AwayTeam      FTHG      FTAG  BAD COLUMN0  0.265389  0.523248  0.093941  0.575946    0.9292961  0.318569  0.667410  0.131798  0.716327    0.2894062  0.183191  0.586513  0.020108  0.828940    0.0046953  0.677817  0.270008  0.735194  0.962189    0.2487534  0.576157  0.592042  0.572252  0.223082    0.952749Even though I feed it column names that don&#8217;t exist in the dataframe, it will only pull out the columns that existnew_df = df.filter(items=all_columns)print(new_df)   HomeTeam  AwayTeam      FTHG      FTAG0  0.265389  0.523248  0.093941  0.5759461  0.318569  0.667410  0.131798  0.7163272  0.183191  0.586513  0.020108  0.8289403  0.677817  0.270008  0.735194  0.9621894  0.576157  0.592042  0.572252  0.223082

Advertisement

Answer