Skip to content
Advertisement

Return random numbers from lists of varying size with weights

I would like to split existing data for a train-test-split in python. Functions like sklearn.train_test_split() typically choose evenly distributed values as testdata. But since I want to check, whether my model can deal with skewed data (more training data on “the left side of the function”) I need to weigh the split more towards the left side of my data.

I thought about using random.choices() and specifying the weights here, but that seems odd since the list containing the weights would have to be very long.

I´m basically looking for a function that I can pass my list and tell it to choose n random numbers from this list, but weigh the first ~30% of entries significantly higher than the rest.

Does anyone have an idea?

Edit: Here is some basic code demonstrating the problem:

JavaScript

As can be seen the split has been applied so that 50% of the data can be used as training data. Also the sklearn function randomly selects data from the left (1,2,3,4,5) and the right (6,7,8,9,10) side of the data whereas the function I´m looking for overweighs data from the left side significantly in the training dataset.

Does this make clear what the objective is?

Advertisement

Answer

I solved the problem this way:

First I created a list of cummulative probabilities with the weights I want:

JavaScript

Then I could then randomly choose entries of indices:

JavaScript

The list train_index now contains randomly selected index entries that I could now use to get my elements from the list:

JavaScript

So this is a workaround that isn´t great but it works for me. Maybe someone else can benefit of these ideas.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement