Skip to content
Advertisement

Keras and Tensorflow OS resource requirement

I keep getting F tensorflow/core/platform/default/env.cc:73] Check failed: ret == 0 (11 vs. 0)Thread tf_data_private_threadpool creation via pthread_create() failed. errors during training, although the machine is quite powerful:

JavaScript

altogether 64 logical cores

ulimit -s gives 32768, ulimit -u gives 1030608

I want to train the following network with a bunch of online generated 512*512 grayscale images along with two additional parameters for each image. Image generation happens in a C++ function called via Pybind11. The C++ function itself is not resource-hungry.

This is my very first AI training code, so it is just copied from some similar application with parameters adjusted. I need the relatively high resolution, because the network needs to learn infer a real number from a small repeated part of the image.

The situation is the same when I leave only the CNN part of the model, without the concatenation. Moreover, I’ve counted the processes createrd during run. The crash happens around 31000 pytnon3 processes of me which I find extreme. Meanwhile nvidia-smi reports around 13G memory consumption on only one of the GPUs.

JavaScript

What could I do to let it run?

Advertisement

Answer

Sorry for everyone. There was a typo in the code resulting in an endless recursion. The process resource exhaustion happened earlier than the stack overflow due to unlimited recursion, that’s why it was hard to spot.

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement