ModuleNotFoundError in Dataflow job

Question

I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform. My project structure is as follows: Here's my setup.py Here's my pipeline code: Functionality of pipeline is Query against BigQuery Table. Count the total records fetched from querying. Print using the custom Log module present in utils folder. I am running the job

Accepted Answer

Posting as community wiki. As confirmed by @GopinathS the error and fix are as follows:The error encountered by the workers is Beam SDK base version 2.32.0 does not match Dataflow Python worker version 2.28.0. Please check Dataflow worker startup logs and make sure that correct version of Beam SDK is installed.To fix this “apache-beam[gcp]>=2.20.0” is removed from install_requires of setup.py since, the ‘>=’ is assigning the latest available version (2.32.0 as of this writing) while the workers version are only 2.28.0.Updated setup.py:setuptools.setup( name='dataflow_example', version='1.0', install_requires=[ "google-cloud-tasks==2.2.0", "google-cloud-pubsub>=0.1.0", "google-cloud-storage==1.39.0", "google-cloud-bigquery==2.6.2", "google-cloud-secret-manager==2.0.0", "google-api-python-client==2.3.0", "oauth2client==4.1.3", # removed apache-beam[gcp]>=2.20.0 "wheel>=0.36.2" ], packages=setuptools.find_packages())Updated beam_options in the pipeline code: beam_options = { "project": self.project, "region": self.region, "job_name": "dataflow_example", "runner": "DataflowRunner", "temp_location": f"gs://{self.bucket}/temp_location/", "setup_file": "./setup.py" }Also make sure that you pass all the pipeline options at once and not partially.If you pass --setup_file in the command then make sure to read and append the setup file path into the already defined beam_options variable using argument_parser in your code.To avoid parsing the argument and appending into beam_options I instead added it directly in beam_options as "setup_file": "./setup.py"

Advertisement

Answer