I could not get the code to calculate psi values to work and I am not very familiar with feature_engine library or in general ML related operations.
The code I am currently trying to run is:
long_list = merge_into_df(oot_path, test_path, train_path, key_mapping_path) long_list.drop(columns=['Unnamed: 0_x', 'CLIENT_ID', 'SET'], inplace=True) long_list['REF_DATE'] = pd.to_datetime(long_list.REF_DATE) print(long_list.head()) transformer = DropHighPSIFeatures( cut_off=pd.to_datetime("2019/09/30"), # the cut_off date split_col='REF_DATE', # the date variable strategy='equal_frequency', bins=8, threshold=0.1, missing_values='ignore' ) transformer.fit_transform(long_list) return transformer.psi_values_
The error message returning is:
Traceback (most recent call last): File "C:UsersDellPipelinemodelling.py", line 124, in <module> test() File "C:UsersDellPipelinemodelling.py", line 98, in test File "C:ProgramDataMiniconda3libsite-packagesfeature_engineselectiondrop_psi_features.py", line 364, in fit test_discrete = bucketer.transform(test_df[[feature]].dropna()) File "C:ProgramDataMiniconda3libsite-packagesfeature_enginediscretisationbase_discretiser.py", line 74, in transform X = super().transform(X) File "C:ProgramDataMiniconda3libsite-packagesfeature_enginebase_transformers.py", line 146, in transform X = check_X(X) File "C:ProgramDataMiniconda3libsite-packagesfeature_enginedataframe_checks.py", line 82, in check_X raise ValueError( ValueError: 0 feature(s) (shape=(0, 1)) while a minimum of 1 is required.
The dataframe print statement in the previous code snippet is:
ID TARGET GROUP_ID BRANCH_ID ... SON_4_12AY_7_12AY_EKOD_1 SON_4_12AY_7_12AY_EKOD_U Unnamed: 0_y REF_DATE 0 0 0 0 1020 ... 0 0 0 2016-12-31 1 2 0 0 2280 ... 0 0 2 2016-12-31 2 3 0 0 1150 ... 0 0 3 2016-12-31 3 4 1 0 1000 ... 0 0 4 2016-12-31 4 5 0 0 1090 ... 0 0 5 2016-12-31 [5 rows x 1976 columns]
So I assumed I don’t have anything problematic in the dataframe itself (apart from the Unnamed: 0_y column maybe)
However, just in case the method in which I create the dataframe from 3 long list csv files and a key mapping csv file is this:
train_df = pd.read_csv(train_path, low_memory=False) test_df = pd.read_csv(test_path, low_memory=False) oot_df = pd.read_csv(oot_path, low_memory=False) key_mapping_df = pd.read_csv(key_mapping_path) long_list_df = pd.concat([train_df, test_df, oot_df], axis=0) long_list_final_df = long_list_df.merge(key_mapping_df, on="ID", how="inner", sort=True) return long_list_final_df
Advertisement
Answer
Turns out that the problem was caused by either the data on the the DataFrame (long_list) being too sparse (too many NaN values) or it being too large. I haven’t done the experiment to figure out which one, but the problem was resolved when I dropped columns with a lot of NaN values.