Why is TensorFlow 2 much slower than TensorFlow 1?

Question

It&#8217;s been cited by many users as the reason for switching to Pytorch, but I&#8217;ve yet to find a justification/explanation for sacrificing the most important practical quality, speed, for eager execution. Below is code benchmarking performance, TF1 vs. TF2 &#8211; with TF1 running anywhere from 47% to…

Accepted Answer

UPDATE 8/1730/2020: TF 2.3 has finally done it: all cases run as fast, or notably faster, than any previous version.Further, my previous update was unfair to TF; my GPU was to blame, has been overheating lately. If you see a rising stem plot of iteration times, it&#8217;s a reliable symptom. Lastly, see a dev&#8217;s note on Eager vs Graph.This might be my last update on this answer. The true stats on your model&#8217;s speed can only be found by you, on your device.UPDATE 5/19/2020: TF 2.2, using same tests: only a minor improvement in Eager speed. Plots for Large-Large Numpy train_on_batch case below, x-axis is successive fit iterations; my GPU isn&#8217;t near its full capacity, so doubt it&#8217;s throttling, but iterations do get slower over time.Per above, Graph and Eager are 1.56x and 1.97x slower than their TF1 counterparts, respectively. Unsure I&#8217;ll debug this further, as I&#8217;m considering switching to Pytorch per TensorFlow&#8217;s poor support for custom / low-level functionality. I did, however, open an Issue to get devs&#8217; feedback.UPDATE 2/18/2020: I&#8217;ve benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that&#8217;s slower, and slower dramatically, is Large-Large &#8211; esp. in Graph execution (1.6x to 2.5x slower).Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested &#8211; one not explainable via randomness/compute-parallelism. I can&#8217;t currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.Haven&#8217;t opened a Git issue on these yet, but I did comment on the original &#8211; no response yet. I&#8217;ll update the answer(s) once progress is made.VERDICT: it isn&#8217;t, IF you know what you&#8217;re doing. But if you don&#8217;t, it could cost you, lots &#8211; by a few GPU upgrades on average, and by multiple GPUs worst-case.THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.I&#8217;ll be updating my answer(s) w/ more info if I learn any &#8211; can bookmark / &#8220;star&#8221; this question for reference.ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source &#8211; including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.The matter, however, is fairly more complex. It isn&#8217;t just TF1 vs. TF2 &#8211; factors yielding significant differences in train speed include:TF2 vs. TF1Eager vs. Graph modekeras vs. tf.kerasnumpy vs. tf.data.Dataset vs. &#8230;train_on_batch() vs. fit()GPU vs. CPUmodel(x) vs. model.predict(x) vs. &#8230;Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what&#8217;ll work best systematically, and with a few shortcuts &#8211; as I&#8217;ll be showing.WHAT SHOULD I DO? Currently, the only way is &#8211; experiment for your specific model, data, and hardware. No single configuration will always work best &#8211; but there are do&#8217;s and don&#8217;t&#8217;s to simplify your search:>> DO:train_on_batch() + numpy + tf.keras + TF1 + Eager/Graphtrain_on_batch() + numpy + tf.keras + TF2 + Graphfit() + numpy + tf.keras + TF1/TF2 + Graph + large model & data>> DON&#8217;T:fit() + numpy + keras for small & medium models and datafit() + numpy + tf.keras + TF1/TF2 + Eagertrain_on_batch() + numpy + keras + TF1 + Eager[Major] tf.python.keras; it can run 10-100x slower, and w/ plenty of bugs; more infoThis includes layers, models, optimizers, & related &#8220;out-of-box&#8221; usage imports; ops, utils, & related &#8216;private&#8217; imports are fine &#8211; but to be sure, check for alts, & whether they&#8217;re used in tf.kerasRefer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the &#8220;BENCHMARKS&#8221; tables in the other answer.LIMITATIONS of the above DO&#8217;s & DON&#8217;T&#8217;s:This question&#8217;s titled &#8220;Why is TF2 much slower than TF1?&#8221;, and while its body concerns training explicitly, the matter isn&#8217;t limited to it; inference, too, is subject to major speed differences, even within the same TF version, import, data format, etc. &#8211; see this answer.RNNs are likely to notably change the data grid in the other answer, as they&#8217;ve been improved in TF2Models primarily used Conv1D and Dense &#8211; no RNNs, sparse data/targets, 4/5D inputs, & other configsInput data limited to numpy and tf.data.Dataset, while many other formats exist; see other answerGPU was used; results will differ on a CPU. In fact, when I asked the question, my CUDA wasn&#8217;t properly configured, and some of the results were CPU-based.Why did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn&#8217;t, clearly &#8211; graph is still available. But if the question is &#8220;why eager at all&#8221;:Superior debugging: you&#8217;ve likely come across multitudes of questions asking &#8220;how do I get intermediate layer outputs&#8221; or &#8220;how do I inspect weights&#8221;; with eager, it&#8217;s (almost) as simple as .__dict__. Graph, in contrast, requires familiarity with special backend functions &#8211; greatly complicating the entire process of debugging & introspection.Faster prototyping: per ideas similar to above; faster understanding = more time left for actual DL.HOW TO ENABLE/DISABLE EAGER?tf.enable_eager_execution()  # TF1; must be done before any model/tensor creationtf.compat.v1.disable_eager_execution() # TF2; above holdsMisleading in TF2; see here.ADDITIONAL INFO:Careful with _on_batch() methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally &#8211; i.e. it&#8217;s to be fixed. See other answer for details.REQUESTS TO TENSORFLOW DEVS: Please fix train_on_batch(), and the performance aspect of calling fit() iteratively; custom train loops are important to many, especially to me.  Add documentation / docstring mention of these performance differences for users&#8217; knowledge. Improve general execution speed to keep peeps from hopping to Pytorch.ACKNOWLEDGEMENTS: Thanks toQ. Scott Zhu, TensorFlow developer, for his detailed clarification on the matter.P. Andrey for sharing useful testing, and discussion.UPDATES:11/14/19 &#8211; found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras and tf.keras, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* &#8211; except Eager, for which TF2 OOM&#8217;d)11/17/19 &#8211; devs updated on_batch() methods in a recent commit, stating to have improved speed &#8211; to be released in TF 2.1, or available now as tf-nightly. As I&#8217;m unable to get latter running, will delay benching until 2.1.2/20/20 &#8211; prediction performance is also worth benching; in TF2, for example, CPU prediction times can involve periodic spikes

Advertisement

Answer