Skip to content
Advertisement

Return random numbers from lists of varying size with weights

I would like to split existing data for a train-test-split in python. Functions like sklearn.train_test_split() typically choose evenly distributed values as testdata. But since I want to check, whether my model can deal with skewed data (more training data on “the left side of the function”) I need to weigh the split more towards the left side of my data.

I thought about using random.choices() and specifying the weights here, but that seems odd since the list containing the weights would have to be very long.

I´m basically looking for a function that I can pass my list and tell it to choose n random numbers from this list, but weigh the first ~30% of entries significantly higher than the rest.

Does anyone have an idea?

Edit: Here is some basic code demonstrating the problem:

l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Apply function I´m looking for
l_train, l_test = function(l, testsize=0.5)
# Apply sklearn train/test/split
l_train_sk, l_test_sk = sklearn.train_test_split(l, testsize=0.5)


# Output:
l_train = [1, 3, 7, 2, 4]
l_test = [6, 9, 10, 5, 8]

l_train_sk = [2, 4, 8, 9, 3]
l_test_sk = [1, 6, 10, 5, 7]

As can be seen the split has been applied so that 50% of the data can be used as training data. Also the sklearn function randomly selects data from the left (1,2,3,4,5) and the right (6,7,8,9,10) side of the data whereas the function I´m looking for overweighs data from the left side significantly in the training dataset.

Does this make clear what the objective is?

Advertisement

Answer

I solved the problem this way:

First I created a list of cummulative probabilities with the weights I want:

# Input Array X_temp: np.array((n,1))
len_half = int(len(X_temp) / 2)
focus_left = 70 # choose values from the left side with a probability of 70%

cum_gesamt = []
cum_left = [x * ((focus_left * 100) / len_half) for x in np.linspace(1, len_half, len_half)]
cum_right = [((focus_left * 100) + (x * (((1 - focus_left) * 100) / len_half))) for x in np.linspace(1, len_half, len_half)]
cum_gesamt.extend(cum_left)
cum_gesamt.extend(cum_right)

Then I could then randomly choose entries of indices:

double = 1
while len(double) != 0:
    count += 1
    train_index = random.choices( np.linspace(0, len(X_temp) - 1, len(X_temp)),cum_weights=cum_gesamt, k=int((1 - test_size) * len(X_temp)),)
    double = [item for item, count in collections.Counter(train_index).items() if count > 1]

The list train_index now contains randomly selected index entries that I could now use to get my elements from the list:

Xtrain_temp = np.array([(X_temp[int(x)]) for x in train_index])

So this is a workaround that isn´t great but it works for me. Maybe someone else can benefit of these ideas.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement