Return random numbers from lists of varying size with weights

Question

I would like to split existing data for a train-test-split in python. Functions like sklearn.train_test_split() typically choose evenly distributed values as testdata. But since I want to check, whether my model can deal with skewed data (more training data on &#8220;the left side of the function&#8221;) I ne…

Accepted Answer

I solved the problem this way:First I created a list of cummulative probabilities with the weights I want:# Input Array X_temp: np.array((n,1))len_half = int(len(X_temp) / 2)focus_left = 70 # choose values from the left side with a probability of 70%cum_gesamt = []cum_left = [x * ((focus_left * 100) / len_half) for x in np.linspace(1, len_half, len_half)]cum_right = [((focus_left * 100) + (x * (((1 - focus_left) * 100) / len_half))) for x in np.linspace(1, len_half, len_half)]cum_gesamt.extend(cum_left)cum_gesamt.extend(cum_right)Then I could then randomly choose entries of indices:double = 1while len(double) != 0:    count += 1    train_index = random.choices( np.linspace(0, len(X_temp) - 1, len(X_temp)),cum_weights=cum_gesamt, k=int((1 - test_size) * len(X_temp)),)    double = [item for item, count in collections.Counter(train_index).items() if count > 1]The list train_index now contains randomly selected index entries that I could now use to get my elements from the list:Xtrain_temp = np.array([(X_temp[int(x)]) for x in train_index])So this is a workaround that isn´t great but it works for me. Maybe someone else can benefit of these ideas.

Advertisement

Answer