Skip to content
Advertisement

Start CloudSQL Proxy on Python Dataflow / Apache Beam

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a Dataflow template which I can start from a AppEngine using a Cron job.

I have a version which works locally using the DirectRunner. For that I use the CloudSQL (Postgres) proxy client so that I can connect to the database on 127.0.0.1 .

When using the DataflowRunner with custom commands to start the proxy within a setup.py script, the job won’t execute. It stucks with repeating this log-message:

Setting node annotation to enable volume controller attach/detach

A part of my setup.py looks the following:

JavaScript

I added the last line as separate subprocess.Popen() within run() after reading this issue on Github from sthomp and this discussion on Stackoverflo. I also tried to play around with some parameters of subprocess.Popen.

Another mentioned solution from brodin was to allow access from every IP address and to connect via username and password. In my understanding he does not claim this as best practice.

Thank you in advance for you help.

!!! Workaround solution at bottom of this post !!!


Update – Logfiles

These are the logs on error level which occur during a job:

JavaScript

Here you can find are all logs after the start of my custom setup.py (log-level: any; all logs):

https://jpst.it/1gk2Z

Update logfiles 2

Job logs (I manually canceled the job after not stucking for a while):

JavaScript

Stack Traces:

JavaScript

Update: Workaround Solution can be found in my answer below

Advertisement

Answer

Workaround Solution:

I finally found a workaround. I took the idea to connect via the public IP of the CloudSQL instance. For that you needed to allow connections to your CloudSQL instance from every IP:

  1. Go to the overview page of your CloudSQL instance in GCP
  2. Click on the Authorization tab
  3. Click on Add network and add 0.0.0.0/0 (!! this will allow every IP address to connect to your instance !!)

To add security to the process, I used SSL keys and only allowed SSL connections to the instance:

  1. Click on SSL tab
  2. Click on Create a new certificate to create a SSL certificate for your server
  3. Click on Create a client certificate to create a SSL certificate for you client
  4. Click on Allow only SSL connections to reject all none SSL connection attempts

After that I stored the certificates in a Google Cloud Storage bucket and load them before connecting within the Dataflow job, i.e.:

JavaScript

I then use these functions in a custom ParDo to perform queries.
Minimal example:

JavaScript

A part of the pipeline then could look like this:

JavaScript

I hope this solution helps others with similar problems

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement