TF.Keras model.predict is slower than straight Numpy?

Question

Thanks, everyone for trying to help me understand the issue below. I have updated the question and produced a CPU-only run and GPU-only of the run. In general, it also appears that in either case a direct numpy calculation hundreds of times faster than the model. predict(). Hopefully, this clarifies that this does not appear to be a CPU vs

Accepted Answer

We observe that the main issue is the cause of the Eager Execution mode. We give shallow look at your code and corresponding results as per CPU and GPU bases. It is true that numpy doesn’t operate on GPU, so unlike tf-gpu, it doesn’t encounter any data shifting overhead.But also it’s quite noticeable how much fast computation is done by your defined predict method with np compare to model. predict with tf. keras, whereas the input test set is 10 samples only. However, We’re not giving any deep analysis, like one piece of art here you may love to read.My Setup is as follows. I’m using the Colab environment and checking with both CPU and GPU mode.TensorFlow 1.15.2Keras 2.3.1Numpy 1.19.5TensorFlow 2.4.1Keras 2.4.0Numpy 1.19.5TF 1.15.2 – CPU%tensorflow_version 1.ximport osos.environ["CUDA_VISIBLE_DEVICES"]="-1" import tensorflow as tffrom tensorflow.python.client import device_libprint(tf.__version__)print('A: ', tf.test.is_built_with_cuda)print('B: ', tf.test.gpu_device_name())local_device_protos = device_lib.list_local_devices()([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])TensorFlow 1.x selected.1.15.2A: B: ([], ['/device:CPU:0'])Now, running your code.import tensorflow as tfimport kerasprint(tf.executing_eagerly()) # False(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()model = keras.models.Sequential([])model.compilemodel.fit%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs1000 loops, best of 5: 1.07 ms per loop1000 loops, best of 5: 1.48 ms per loopWe can see that the execution times are comparable with old keras. Now, let’s test with GPU also.TF 1.15.2 – GPU%tensorflow_version 1.ximport osos.environ["CUDA_VISIBLE_DEVICES"]="0" import tensorflow as tffrom tensorflow.python.client import device_libprint(tf.__version__)print('A: ', tf.test.is_built_with_cuda)print('B: ', tf.test.gpu_device_name())local_device_protos = device_lib.list_local_devices()([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])1.15.2A: B: /device:GPU:0(['/device:GPU:0'], ['/device:CPU:0'])......%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs1000 loops, best of 5: 1.02 ms per loop1000 loops, best of 5: 1.44 ms per loopNow, the execution time is also comparable here with old keras and no eager mode. Let’s now see the new tf. keras with eager mode first and then we observe without eager mode.TF 2.4.1 – CPUEagerlyimport osos.environ["CUDA_VISIBLE_DEVICES"]="-1" import tensorflow as tffrom tensorflow.python.client import device_libprint(tf.__version__)print('A: ', tf.test.is_built_with_cuda)print('B: ', tf.test.gpu_device_name())local_device_protos = device_lib.list_local_devices()([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])2.4.1A: B: ([], ['/device:CPU:0'])Now, running the code with eager mode.import tensorflow as tfimport kerasprint(tf.executing_eagerly()) # True(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()model = keras.models.Sequential([ ])model.compilemodel.fit%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs10 loops, best of 5: 28 ms per loop1000 loops, best of 5: 1.73 ms per loopDisable EagerlyNow, if we disable the eager mode and run the same code as follows then we will get:import tensorflow as tfimport keras# # Disables eager executiontf.compat.v1.disable_eager_execution()# or, # Disables eager execution of tf.functions.# tf.config.run_functions_eagerly(False)print(tf.executing_eagerly())False(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()model = keras.models.Sequential([])model.compilemodel.fit%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs1000 loops, best of 5: 1.37 ms per loop1000 loops, best of 5: 1.57 ms per loopNow, we can see the execution times are comparable for disabling the eager mode in new tf. keras. Now, let’s test with GPU mode also.TF 2.4.1 – GPUEagerlyimport osos.environ["CUDA_VISIBLE_DEVICES"]="0" import tensorflow as tffrom tensorflow.python.client import device_libprint(tf.__version__)print('A: ', tf.test.is_built_with_cuda)print('B: ', tf.test.gpu_device_name())local_device_protos = device_lib.list_local_devices()([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])2.4.1A: B: /device:GPU:0(['/device:GPU:0'], ['/device:CPU:0'])import tensorflow as tfimport kerasprint(tf.executing_eagerly()) # True(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()model = keras.models.Sequential([ ])model.compilemodel.fit%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs10 loops, best of 5: 26.3 ms per loop1000 loops, best of 5: 1.48 ms per loopDisable EagerlyAnd lastly again, if we disable the eager mode and run the same code as follows, we will get:# Disables eager executiontf.compat.v1.disable_eager_execution()# or, # Disables eager execution of tf.functions.# tf.config.run_functions_eagerly(False)print(tf.executing_eagerly()) # False (X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()model = keras.models.Sequential([ ])model.compilemodel.fit%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs1000 loops, best of 5: 1.12 ms per loop1000 loops, best of 5: 1.45 ms per loopAnd like before, the execution times are comparable with the non-eager mode in new tf. keras. That’s why, the Eager mode is the root cause of the slower performance of tf. keras than straight numpy.

Advertisement

Answer

TF 1.15.2 – CPU

TF 1.15.2 – GPU

TF 2.4.1 – CPU

TF 2.4.1 – GPU