I am having serious issues running a python Apache Beam pipeline using a GCP Dataflow runner, launched from CircleCI. I would really appreciate if someone could give any hint on how to tackle this, I’ve tried it all but nothing seems to work.
Basically, I’m running this python Apache Beam pipeline which runs in Dataflow and uses google-api-python-client-1.12.3
. If I run the job in my machine (python3 main.py --runner dataflow --setup_file /path/to/my/file/setup.py
), it works fine. If I run this same job from within CircleCI, the Dataflow job is created, but it fails with a message ImportError: No module named 'apiclient'
.
By looking at this documentation, I think I should probably use explicitely a requirements.txt
file. If I run that same pipeline from CircleCI, but adding the --requirements_file
argument to a requirements file containing a single line (google-api-python-client==1.12.3
), the dataflow job fails because the workers fail too. In the logs, there’s a info message first ERROR: Could not find a version that satisfies the requirement wheel (from versions: none)"
which results in a later error message "Error syncing pod somePodIdHere ("dataflow-myjob-harness-rl84_default(somePodIdHere)"), skipping: failed to "StartContainer" for "python" with CrashLoopBackOff: "back-off 40s restarting failed container=python pod=dataflow-myjob-harness-rl84_default(somePodIdHere)"
. I found this thread but the solution didn’t seem to work in my case.
Any help would be really, really appreciated. Thanks a lot in advance!
Advertisement
Answer
This question looks very similar to yours. The solution seemed to be to explicitly include the dependencies of your requirements in your requirements.txt