Skip to content
Advertisement

Keras and Tensorflow OS resource requirement

I keep getting F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed. errors during training, although the machine is quite powerful:

memory size: 256GiB
2 pieces of AMD EPYC 7302 16-Core Processor
8 pieces of NVIDIA A2

altogether 64 logical cores

ulimit -s gives 32768, ulimit -u gives 1030608

I want to train the following network with a bunch of online generated 512*512 grayscale images along with two additional parameters for each image. Image generation happens in a C++ function called via Pybind11. The C++ function itself is not resource-hungry.

This is my very first AI training code, so it is just copied from some similar application with parameters adjusted. I need the relatively high resolution, because the network needs to learn infer a real number from a small repeated part of the image.

The situation is the same when I leave only the CNN part of the model, without the concatenation. Moreover, I’ve counted the processes createrd during run. The crash happens around 31000 pytnon3 processes of me which I find extreme. Meanwhile nvidia-smi reports around 13G memory consumption on only one of the GPUs.

# this one in module landscapeGenerator
def generate(aBatchSize:int=32, aRepeatParameter:int=2):
  dim = (512, 512)
  paraShape = (aRepeatParameter * 2)
  def generator():
    xParameter = numpy.empty(paraShape, dtype=float)
    xImage     = numpy.empty(aDim, dtype=float)
    y          = numpy.empty((1), dtype=float)
# set parameters, use them to obtain the image via Pybind11
    xImage = randomLandscape(dist, height, tempAmb, tempBase)
    xParameter[0] = xImage[0, 0] / 0.04  # Field of view is at most 0.04 radians
    xImage[0, 0]  = xImage[0, 1]
    xParameter[aRepeatParameter] = something
    for i in range(1, aRepeatParameter):
      xParameter[i] = xParameter[0]
      xParameter[aRepeatParamter + i] = xParameter[aRepeatParameter]
    y[0]          = something
    yield {"parameters": xParameters, "image": xImage}, y

  dataset = tensorflow.data.Dataset.from_generator(generate,
    output_signature=(
      (tensorflow.TensorSpec(shape=paraShape, dtype=tensorflow.float32, name="parameters"),
      tensorflow.TensorSpec(shape=dim, dtype=tensorflow.float32, name="image")),
      tensorflow.TensorSpec(shape=(1), dtype=tensorflow.float32, name="y")
            ))
  dataset = dataset.batch(aBatchSize)
  return dataset

def createMlp(aRepeatParameter:int=2):
  model = Sequential()
  vectorSize = aRepeatParameter * 2
  model.add(Dense(vectorSize, input_dim=(vectorSize), activation="relu"))
  model.add(Dense(aRepeatParameter, activation="relu"))
  return model

def createCnn():
  filters=(512, 128, 32)
  inputShape = (512, 512, 1)
  chanDim = -1
  inputs = Input(shape=inputShape)
  for (i, f) in enumerate(filters):
    if i == 0:
      x = inputs
    x = Conv2D(f, (3, 3), padding="same")(x)
    x = Activation("relu")(x)
    x = BatchNormalization(axis=chanDim)(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
  x = Flatten()(x)
  x = Dense(16)(x)
  x = Activation("relu")(x)
  x = BatchNormalization(axis=chanDim)(x)
  x = Dropout(0.5)(x)
  x = Dense(4)(x)
  x = Activation("relu")(x)
  model = Model(inputs, x)
  return model

repeatParameter:int = 2
mlp = createMlp(repeatParameter)
cnn = createCnn()
combinedInput = concatenate([mlp.output, cnn.output])
x = Dense(4, activation="relu")(combinedInput)
x = Dense(1, activation="linear")(x)
model = Model(inputs=[mlp.input, cnn.input], outputs=x)

opt = Adam(learning_rate=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_absolute_percentage_error", optimizer=opt)

batchSize = 32
model.fit(landscapeGenerator.generate(batchSize, repeatParameter), validation_data=landscapeGenerator.generate(batchSize, repeatParameter),
  epochs=10, steps_per_epoch=10, validation_split=0.3)

model.save('trainAiTemp.model')

What could I do to let it run?

Advertisement

Answer

Sorry for everyone. There was a typo in the code resulting in an endless recursion. The process resource exhaustion happened earlier than the stack overflow due to unlimited recursion, that’s why it was hard to spot.

def generate(aBatchSize:int=32, aRepeatParameter:int=2):
  dim = (512, 512)
  paraShape = (aRepeatParameter * 2)
  def generator():
    xParameter = numpy.empty(paraShape, dtype=float)
# ...

  dataset = tensorflow.data.Dataset.from_generator(generate,) # ...
# Here the generate referred to the outer function resulting in endless
# recursion. It should have been generator.

  dataset = dataset.batch(aBatchSize)
  return dataset
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement