I would like to split existing data for a train-test-split in python. Functions like sklearn.train_test_split()
typically choose evenly distributed values as testdata. But since I want to check, whether my model can deal with skewed data (more training data on “the left side of the function”) I need to weigh the split more towards the left side of my data.
I thought about using random.choices()
and specifying the weights here, but that seems odd since the list containing the weights would have to be very long.
I´m basically looking for a function that I can pass my list and tell it to choose n random numbers from this list, but weigh the first ~30% of entries significantly higher than the rest.
Does anyone have an idea?
Edit: Here is some basic code demonstrating the problem:
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Apply function I´m looking for l_train, l_test = function(l, testsize=0.5) # Apply sklearn train/test/split l_train_sk, l_test_sk = sklearn.train_test_split(l, testsize=0.5) # Output: l_train = [1, 3, 7, 2, 4] l_test = [6, 9, 10, 5, 8] l_train_sk = [2, 4, 8, 9, 3] l_test_sk = [1, 6, 10, 5, 7]
As can be seen the split has been applied so that 50% of the data can be used as training data. Also the sklearn function randomly selects data from the left (1,2,3,4,5) and the right (6,7,8,9,10) side of the data whereas the function I´m looking for overweighs data from the left side significantly in the training dataset.
Does this make clear what the objective is?
Advertisement
Answer
I solved the problem this way:
First I created a list of cummulative probabilities with the weights I want:
# Input Array X_temp: np.array((n,1)) len_half = int(len(X_temp) / 2) focus_left = 70 # choose values from the left side with a probability of 70% cum_gesamt = [] cum_left = [x * ((focus_left * 100) / len_half) for x in np.linspace(1, len_half, len_half)] cum_right = [((focus_left * 100) + (x * (((1 - focus_left) * 100) / len_half))) for x in np.linspace(1, len_half, len_half)] cum_gesamt.extend(cum_left) cum_gesamt.extend(cum_right)
Then I could then randomly choose entries of indices:
double = 1 while len(double) != 0: count += 1 train_index = random.choices( np.linspace(0, len(X_temp) - 1, len(X_temp)),cum_weights=cum_gesamt, k=int((1 - test_size) * len(X_temp)),) double = [item for item, count in collections.Counter(train_index).items() if count > 1]
The list train_index
now contains randomly selected index entries that I could now use to get my elements from the list:
Xtrain_temp = np.array([(X_temp[int(x)]) for x in train_index])
So this is a workaround that isn´t great but it works for me. Maybe someone else can benefit of these ideas.