Problem Summary In the following example, my NMT model has high loss because it correctly predicts target_input instead of target_output. As is evident, the prediction matches up almost 100% with target_input instead of target_output, as it should (off-by-one). Loss and gradients are being calculated using ta…