I keep getting F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed.
errors during training, although the machine is quite powerful:
memory size: 256GiB 2 pieces of AMD EPYC 7302 16-Core Processor 8 pieces of NVIDIA A2
altogether 64 logical cores
ulimit -s gives 32768, ulimit -u gives 1030608
I want to train the following network with a bunch of online generated 512*512 grayscale images along with two additional parameters for each image. Image generation happens in a C++ function called via Pybind11. The C++ function itself is not resource-hungry.
This is my very first AI training code, so it is just copied from some similar application with parameters adjusted. I need the relatively high resolution, because the network needs to learn infer a real number from a small repeated part of the image.
The situation is the same when I leave only the CNN part of the model, without the concatenation. Moreover, I’ve counted the processes createrd during run. The crash happens around 31000 pytnon3 processes of me which I find extreme. Meanwhile nvidia-smi reports around 13G memory consumption on only one of the GPUs.
# this one in module landscapeGenerator def generate(aBatchSize:int=32, aRepeatParameter:int=2): dim = (512, 512) paraShape = (aRepeatParameter * 2) def generator(): xParameter = numpy.empty(paraShape, dtype=float) xImage = numpy.empty(aDim, dtype=float) y = numpy.empty((1), dtype=float) # set parameters, use them to obtain the image via Pybind11 xImage = randomLandscape(dist, height, tempAmb, tempBase) xParameter[0] = xImage[0, 0] / 0.04 # Field of view is at most 0.04 radians xImage[0, 0] = xImage[0, 1] xParameter[aRepeatParameter] = something for i in range(1, aRepeatParameter): xParameter[i] = xParameter[0] xParameter[aRepeatParamter + i] = xParameter[aRepeatParameter] y[0] = something yield {"parameters": xParameters, "image": xImage}, y dataset = tensorflow.data.Dataset.from_generator(generate, output_signature=( (tensorflow.TensorSpec(shape=paraShape, dtype=tensorflow.float32, name="parameters"), tensorflow.TensorSpec(shape=dim, dtype=tensorflow.float32, name="image")), tensorflow.TensorSpec(shape=(1), dtype=tensorflow.float32, name="y") )) dataset = dataset.batch(aBatchSize) return dataset def createMlp(aRepeatParameter:int=2): model = Sequential() vectorSize = aRepeatParameter * 2 model.add(Dense(vectorSize, input_dim=(vectorSize), activation="relu")) model.add(Dense(aRepeatParameter, activation="relu")) return model def createCnn(): filters=(512, 128, 32) inputShape = (512, 512, 1) chanDim = -1 inputs = Input(shape=inputShape) for (i, f) in enumerate(filters): if i == 0: x = inputs x = Conv2D(f, (3, 3), padding="same")(x) x = Activation("relu")(x) x = BatchNormalization(axis=chanDim)(x) x = MaxPooling2D(pool_size=(2, 2))(x) x = Flatten()(x) x = Dense(16)(x) x = Activation("relu")(x) x = BatchNormalization(axis=chanDim)(x) x = Dropout(0.5)(x) x = Dense(4)(x) x = Activation("relu")(x) model = Model(inputs, x) return model repeatParameter:int = 2 mlp = createMlp(repeatParameter) cnn = createCnn() combinedInput = concatenate([mlp.output, cnn.output]) x = Dense(4, activation="relu")(combinedInput) x = Dense(1, activation="linear")(x) model = Model(inputs=[mlp.input, cnn.input], outputs=x) opt = Adam(learning_rate=1e-3, decay=1e-3 / 200) model.compile(loss="mean_absolute_percentage_error", optimizer=opt) batchSize = 32 model.fit(landscapeGenerator.generate(batchSize, repeatParameter), validation_data=landscapeGenerator.generate(batchSize, repeatParameter), epochs=10, steps_per_epoch=10, validation_split=0.3) model.save('trainAiTemp.model')
What could I do to let it run?
Advertisement
Answer
Sorry for everyone. There was a typo in the code resulting in an endless recursion. The process resource exhaustion happened earlier than the stack overflow due to unlimited recursion, that’s why it was hard to spot.
def generate(aBatchSize:int=32, aRepeatParameter:int=2): dim = (512, 512) paraShape = (aRepeatParameter * 2) def generator(): xParameter = numpy.empty(paraShape, dtype=float) # ... dataset = tensorflow.data.Dataset.from_generator(generate,) # ... # Here the generate referred to the outer function resulting in endless # recursion. It should have been generator. dataset = dataset.batch(aBatchSize) return dataset