Thanks, everyone for trying to help me understand the issue below. I have updated the question and produced a CPU-only run and GPU-only of the run. In general, it also appears that in either case a direct numpy
calculation hundreds of times faster than the model. predict()
. Hopefully, this clarifies that this does not appear to be a CPU vs GPU issue (if it is, I would love an explanation).
Let’s create a trained model with keras.
import tensorflow as tf (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(1000,'relu'), tf.keras.layers.Dense(100,'relu'), tf.keras.layers.Dense(10,'softmax'), ]) model.compile('adam','sparse_categorical_crossentropy') model.fit(X,Y,epochs=20,batch_size=1024)
Now let’s re-create the model.predict
function using numpy
.
import numpy as np W = model.get_weights() def predict(X): X = X.reshape((X.shape[0],-1)) #Flatten X = X @ W[0] + W[1] #Dense X[X<0] = 0 #Relu X = X @ W[2] + W[3] #Dense X[X<0] = 0 #Relu X = X @ W[4] + W[5] #Dense X = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax return X
We can easily verify these are the same function (module machine errors in implementation).
print(model.predict(X[:100]).argmax(1)) print(predict(X[:100]).argmax(1))
We can also test out how fast these functions run. Using ipython
:
%timeit model.predict(X[:10]).argmax(1) # 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # 1000 loops takes 356 µs
I get that predict
runs about 10,000 times faster than model. predict
at low batches and reduces to around 100 times faster at larger batches. Regardless, why is predict
so much faster? In fact, predict
isn’t even optimized, we could use numba
, or even straight re-write predict
in C
code and compile it.
Thinking in terms of deployment purposes, why would manually extracting the weights from the model and re-writing the function be thousands of times faster than what keras
does internally? This also means that writing a script to utilize a .h5
file or similar, maybe much slower than manually re-writing the prediction function. In general, is this true?
Ipython Output (CPU):
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. PyDev console: using IPython 7.19.0 Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32 import os os.environ["CUDA_VISIBLE_DEVICES"]="-1" import tensorflow as tf (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(1000,'relu'), tf.keras.layers.Dense(100,'relu'), tf.keras.layers.Dense(10,'softmax'), ]) model.compile('adam','sparse_categorical_crossentropy') model.fit(X,Y,epochs=20,batch_size=1024) 2021-04-19 15:10:58.323137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-04-19 15:11:01.990590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2021-04-19 15:11:02.039285: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-04-19 15:11:02.042553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-G0U8S3P 2021-04-19 15:11:02.043134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-G0U8S3P 2021-04-19 15:11:02.128834: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2) Epoch 1/20 59/59 [==============================] - 4s 60ms/step - loss: 35.3708 Epoch 2/20 59/59 [==============================] - 3s 58ms/step - loss: 0.8671 Epoch 3/20 59/59 [==============================] - 3s 56ms/step - loss: 0.5641 Epoch 4/20 59/59 [==============================] - 3s 56ms/step - loss: 0.4359 Epoch 5/20 59/59 [==============================] - 3s 56ms/step - loss: 0.3447 Epoch 6/20 59/59 [==============================] - 3s 56ms/step - loss: 0.2891 Epoch 7/20 59/59 [==============================] - 3s 56ms/step - loss: 0.2371 Epoch 8/20 59/59 [==============================] - 3s 57ms/step - loss: 0.1977 Epoch 9/20 59/59 [==============================] - 3s 57ms/step - loss: 0.1713 Epoch 10/20 59/59 [==============================] - 3s 57ms/step - loss: 0.1381 Epoch 11/20 59/59 [==============================] - 4s 61ms/step - loss: 0.1203 Epoch 12/20 59/59 [==============================] - 3s 57ms/step - loss: 0.1095 Epoch 13/20 59/59 [==============================] - 3s 56ms/step - loss: 0.0877 Epoch 14/20 59/59 [==============================] - 3s 57ms/step - loss: 0.0793 Epoch 15/20 59/59 [==============================] - 3s 56ms/step - loss: 0.0727 Epoch 16/20 59/59 [==============================] - 3s 56ms/step - loss: 0.0702 Epoch 17/20 59/59 [==============================] - 3s 56ms/step - loss: 0.0701 Epoch 18/20 59/59 [==============================] - 3s 57ms/step - loss: 0.0631 Epoch 19/20 59/59 [==============================] - 3s 56ms/step - loss: 0.0539 Epoch 20/20 59/59 [==============================] - 3s 58ms/step - loss: 0.0493 Out[3]: <tensorflow.python.keras.callbacks.History at 0x143069fdf40> import numpy as np W = model.get_weights() def predict(X): X = X.reshape((X.shape[0],-1)) #Flatten X = X @ W[0] + W[1] #Dense X[X<0] = 0 #Relu X = X @ W[2] + W[3] #Dense X[X<0] = 0 #Relu X = X @ W[4] + W[5] #Dense X = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax return X %timeit model.predict(X[:10]).argmax(1) # 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # 1000 loops takes 356 µs 52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 640 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ipython Output (GPU):
Python 3.7.7 (default, Mar 26 2020, 15:48:22) Type 'copyright', 'credits' or 'license' for more information IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import tensorflow as tf ...: ...: (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() ...: ...: model = tf.keras.models.Sequential([ ...: tf.keras.layers.Flatten(), ...: tf.keras.layers.Dense(1000,'relu'), ...: tf.keras.layers.Dense(100,'relu'), ...: tf.keras.layers.Dense(10,'softmax'), ...: ]) ...: model.compile('adam','sparse_categorical_crossentropy') ...: model.fit(X,Y,epochs=20,batch_size=1024) 2020-07-01 15:50:46.008518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-07-01 15:50:46.054495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:05:00.0 2020-07-01 15:50:46.059582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-07-01 15:50:46.114562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-07-01 15:50:46.142058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-07-01 15:50:46.152899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-07-01 15:50:46.217725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-07-01 15:50:46.260758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-07-01 15:50:46.374328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-07-01 15:50:46.376747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2020-07-01 15:50:46.377688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA 2020-07-01 15:50:46.433422: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4018875000 Hz 2020-07-01 15:50:46.434383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4d0d71c0 executing computations on platform Host. Devices: 2020-07-01 15:50:46.435119: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version 2020-07-01 15:50:46.596077: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4a9379f0 executing computations on platform CUDA. Devices: 2020-07-01 15:50:46.596119: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-01 15:50:46.597894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:05:00.0 2020-07-01 15:50:46.597961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-07-01 15:50:46.597988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-07-01 15:50:46.598014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-07-01 15:50:46.598040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-07-01 15:50:46.598065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-07-01 15:50:46.598090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-07-01 15:50:46.598115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-07-01 15:50:46.599766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2020-07-01 15:50:46.600611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-07-01 15:50:46.603713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-01 15:50:46.603751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2020-07-01 15:50:46.603763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2020-07-01 15:50:46.605917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5) Train on 60000 samples Epoch 1/20 2020-07-01 15:50:49.995091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 60000/60000 [==============================] - 2s 26us/sample - loss: 9.9370 Epoch 2/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.6094 Epoch 3/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.3672 Epoch 4/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.2720 Epoch 5/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.2196 Epoch 6/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.1673 Epoch 7/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.1367 Epoch 8/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.1082 Epoch 9/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0895 Epoch 10/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0781 Epoch 11/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0666 Epoch 12/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0537 Epoch 13/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0459 Epoch 14/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0412 Epoch 15/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0401 Epoch 16/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0318 Epoch 17/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0275 Epoch 18/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0237 Epoch 19/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0212 Epoch 20/20 60000/60000 [==============================] - 0s 4us/sample - loss: 0.0199 Out[1]: <tensorflow.python.keras.callbacks.History at 0x7f7c9000b550> In [2]: import numpy as np ...: ...: W = model.get_weights() ...: ...: def predict(X): ...: X = X.reshape((X.shape[0],-1)) #Flatten ...: X = X @ W[0] + W[1] #Dense ...: X[X<0] = 0 #Relu ...: X = X @ W[2] + W[3] #Dense ...: X[X<0] = 0 #Relu ...: X = X @ W[4] + W[5] #Dense ...: X = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax ...: return X ...: In [3]: print(model.predict(X[:100]).argmax(1)) ...: print(predict(X[:100]).argmax(1)) [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1] /home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: overflow encountered in exp /home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: invalid value encountered in true_divide [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1] In [4]: %timeit model.predict(X[:10]).argmax(1) 37.7 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [5]: %timeit predict(X[:10]).argmax(1) 361 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Advertisement
Answer
We observe that the main issue is the cause of the Eager Execution mode. We give shallow look at your code and corresponding results as per CPU and GPU bases. It is true that numpy
doesn’t operate on GPU, so unlike tf-gpu
, it doesn’t encounter any data shifting overhead.
But also it’s quite noticeable how much fast computation is done by your defined predict
method with np
compare to model. predict
with tf. keras
, whereas the input test set is 10 samples only. However, We’re not giving any deep analysis, like one piece of art here you may love to read.
My Setup is as follows. I’m using the Colab environment and checking with both CPU and GPU mode.
TensorFlow 1.15.2 Keras 2.3.1 Numpy 1.19.5 TensorFlow 2.4.1 Keras 2.4.0 Numpy 1.19.5
TF 1.15.2 – CPU
%tensorflow_version 1.x import os os.environ["CUDA_VISIBLE_DEVICES"]="-1" import tensorflow as tf from tensorflow.python.client import device_lib print(tf.__version__) print('A: ', tf.test.is_built_with_cuda) print('B: ', tf.test.gpu_device_name()) local_device_protos = device_lib.list_local_devices() ([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])
TensorFlow 1.x selected. 1.15.2 A: <function is_built_with_cuda at 0x7f122d58dcb0> B: ([], ['/device:CPU:0'])
Now, running your code.
import tensorflow as tf import keras print(tf.executing_eagerly()) # False (X,Y),(Xt,Yt) = keras.datasets.mnist.load_data() model = keras.models.Sequential([]) model.compile model.fit %timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs 1000 loops, best of 5: 1.07 ms per loop 1000 loops, best of 5: 1.48 ms per loop
We can see that the execution times are comparable with old keras
. Now, let’s test with GPU also.
TF 1.15.2 – GPU
%tensorflow_version 1.x import os os.environ["CUDA_VISIBLE_DEVICES"]="0" import tensorflow as tf from tensorflow.python.client import device_lib print(tf.__version__) print('A: ', tf.test.is_built_with_cuda) print('B: ', tf.test.gpu_device_name()) local_device_protos = device_lib.list_local_devices() ([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])
1.15.2 A: <function is_built_with_cuda at 0x7f0b5ad46830> B: /device:GPU:0 (['/device:GPU:0'], ['/device:CPU:0'])
... ... %timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs 1000 loops, best of 5: 1.02 ms per loop 1000 loops, best of 5: 1.44 ms per loop
Now, the execution time is also comparable here with old keras
and no eager mode. Let’s now see the new tf. keras
with eager mode first and then we observe without eager mode.
TF 2.4.1 – CPU
Eagerly
import os os.environ["CUDA_VISIBLE_DEVICES"]="-1" import tensorflow as tf from tensorflow.python.client import device_lib print(tf.__version__) print('A: ', tf.test.is_built_with_cuda) print('B: ', tf.test.gpu_device_name()) local_device_protos = device_lib.list_local_devices() ([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])
2.4.1 A: <function is_built_with_cuda at 0x7fed85de3560> B: ([], ['/device:CPU:0'])
Now, running the code with eager mode.
import tensorflow as tf import keras print(tf.executing_eagerly()) # True (X,Y),(Xt,Yt) = keras.datasets.mnist.load_data() model = keras.models.Sequential([ ]) model.compile model.fit %timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs 10 loops, best of 5: 28 ms per loop 1000 loops, best of 5: 1.73 ms per loop
Disable Eagerly
Now, if we disable the eager mode and run the same code as follows then we will get:
import tensorflow as tf import keras # # Disables eager execution tf.compat.v1.disable_eager_execution() # or, # Disables eager execution of tf.functions. # tf.config.run_functions_eagerly(False) print(tf.executing_eagerly()) False
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data() model = keras.models.Sequential([]) model.compile model.fit %timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs 1000 loops, best of 5: 1.37 ms per loop 1000 loops, best of 5: 1.57 ms per loop
Now, we can see the execution times are comparable for disabling the eager mode in new tf. keras
. Now, let’s test with GPU mode also.
TF 2.4.1 – GPU
Eagerly
import os os.environ["CUDA_VISIBLE_DEVICES"]="0" import tensorflow as tf from tensorflow.python.client import device_lib print(tf.__version__) print('A: ', tf.test.is_built_with_cuda) print('B: ', tf.test.gpu_device_name()) local_device_protos = device_lib.list_local_devices() ([x.name for x in local_device_protos if x.device_type == 'GPU'], [x.name for x in local_device_protos if x.device_type == 'CPU'])
2.4.1 A: <function is_built_with_cuda at 0x7f16ad88f680> B: /device:GPU:0 (['/device:GPU:0'], ['/device:CPU:0'])
import tensorflow as tf import keras print(tf.executing_eagerly()) # True (X,Y),(Xt,Yt) = keras.datasets.mnist.load_data() model = keras.models.Sequential([ ]) model.compile model.fit %timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs 10 loops, best of 5: 26.3 ms per loop 1000 loops, best of 5: 1.48 ms per loop
Disable Eagerly
And lastly again, if we disable the eager mode and run the same code as follows, we will get:
# Disables eager execution tf.compat.v1.disable_eager_execution() # or, # Disables eager execution of tf.functions. # tf.config.run_functions_eagerly(False) print(tf.executing_eagerly()) # False (X,Y),(Xt,Yt) = keras.datasets.mnist.load_data() model = keras.models.Sequential([ ]) model.compile model.fit %timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms %timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs 1000 loops, best of 5: 1.12 ms per loop 1000 loops, best of 5: 1.45 ms per loop
And like before, the execution times are comparable with the non-eager mode in new tf. keras
. That’s why, the Eager mode is the root cause of the slower performance of tf. keras
than straight numpy
.