Skip to content
Advertisement

Itertools combinations, ¿How to make it faster?

I am coding this program that takes 54 (num1) numbers and puts them in a list. It then takes 16 (num2) of those numbers and forms a list that contains lists of 16 numbers chosen from all the combinations possible of “num1″c”num2”. It then takes those lists and generates 4×4 arrays.

The code I have works, but running 54 numbers to get all the arrays I want will take a long time. I know this because I have tested the code using from 20 up to 40 numbers and timed it.

20 numbers =  0.000055 minutes
30 numbers =  0.045088 minutes
40 numbers =  17.46944 minutes

Using all the 20 points of test data I got, I built a math model to predict how long it would take to run the 54 numbers, and I am getting 1740 minutes = 29 hours. This is already an improvement from a v1 of this code that was predicting 38 hours and from a v0 that was actually crashing my machine.

I am reaching out to you to try and make this run even faster. The program is not even RAM intensive. I have 8GB of RAM and a core i7 processor, it does not slow down my machine at all. It actually runs very smooth compared to previous versions I had where my computer even crashed a few times.

Do you guys think there is a way? I am currently sampling to reduce processing time but I would prefer not to sample it at all, if that’s possible. I am not even printing the arrays also to reduce processing time, I am just printing a counter to see how many combinations I generated.

This is the code:

import numpy as np
import itertools
from itertools import combinations
from itertools import islice
from random import sample

num1 = 30 #ideally this is 54, just using 30 now so you can test it.
num2 = 16
steps = 1454226 #represents 1% of "num1"c"num2" just to reduce processing time for testing.

nums=list()
for i in range(1,num1+1):
    nums.append(i)
#print ("nums: ", nums) #Just to ensure that I am indeed using numbers from 1 to num1

vun=list()
tabl=list()
counter = 0
combin = islice(itertools.combinations(nums, num2),0,None,steps)
for i in set(combin):
    vun.append(sample(i,num2)) 
    counter = counter + 1
    p1=i[0];p2=i[1];p3=i[2];p4=i[3];p5=i[4];p6=i[5];p7=i[6];p8=i[7];p9=i[8]
    p10=i[9];p11=i[10];p12=i[11];p13=i[12];p14=i[13];p15=i[14];p16=i[15]
    tun = np.array ([(p1,p2,p3,p4),(p5,p6,p7,p8),(p9,p10,p11,p12),(p13,p14,p15,p16)])
    tabl.append(tun)
    # print ("TABL:" ,tabl)
# print ("vun: ", vun)
print ("combinations:",counter)

The output I get with this code is:

combinations: 101

Ideally, this number would be 2.109492366(10)¹³ or at least 1%. As long as it runs the 54×16 and does not take 29 hours.

Advertisement

Answer

The main inefficiency comes from generating all the combinations (itertools.combinations(nums, num2)), only to throw most of them away.

Another approach would be to generate combinations at random, ensuring there are no duplicates.

import itertools
import random

def random_combination(iterable, r):
    "Random selection from itertools.combinations(iterable, r)"
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.sample(range(n), r))
    return tuple(pool[i] for i in indices)
    
items = list(range(1, 55))

samples = set()

while len(samples) < 100_000:
    sample = random_combination(items, 16)
    samples.add(sample)

for sample in samples:
    board = list(sample)
    random.shuffle(board)
    board = [board[0:4], board[4: 8], board[8: 12], board[12: 16]]

print("done")

This uses the random_combination function from answers to this question, which in turn comes from the itertools documentation.

The code generates 100,000 unique 4×4 samples in about ten seconds, at least on my machine.

A few notes:

  • Each sample is a tuple and the entries are sorted; this means we can store them in a set and avoid duplicates.
  • Because of the first point, we shuffle each sample before creating a 4×4 board from it; the later code doesn’t do anything with these boards, but I wanted to include them to get a sense of the timing.
  • It’s possible that there would be lots of hash collisions if you were to sample a large proportion of the space, but that’s not feasible anyway because of the amount of data that would involve (see below).

I think there’s been some confusion about what you are trying to achieve here.

54C16 = 2.1 x 10^13 … to store 16 8-bit integers for all of these points would take 2.7 x 10^15 bits, which is 337.5 terabytes. That’s beyond what could be stored on a local disk.

So, to cover even 1% of the space would take over 3TB … maybe possible to store on disk at a push. You hinted in the question that you’d like to cover this proportion of the space. Clearly that’s not going to happen in 8GB of RAM.

Advertisement