Start CloudSQL Proxy on Python Dataflow / Apache Beam

Question

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a Dataflow template which I can start from a AppEngine using a Cron job. I have a version which works locally usin…

Accepted Answer

Workaround Solution:I finally found a workaround. I took the idea to connect via the public IP of the CloudSQL instance. For that you needed to allow connections to your CloudSQL instance from every IP:Go to the overview page of your CloudSQL instance in GCPClick on the Authorization tabClick on Add network and add 0.0.0.0/0 (!! this will allow every IP address to connect to your instance !!)To add security to the process, I used SSL keys and only allowed SSL connections to the instance:Click on SSL tabClick on Create a new certificate to create a SSL certificate for your serverClick on Create a client certificate to create a SSL certificate for you clientClick on Allow only SSL connections to reject all none SSL connection attempts After that I stored the certificates in a Google Cloud Storage bucket and loadthem before connecting within the Dataflow job, i.e.:import psycopg2import psycopg2.extensionsimport osimport statfrom google.cloud import storage# Function to wait for open connection when processing paralleldef wait(conn):    while 1:        state = conn.poll()        if state == psycopg2.extensions.POLL_OK:            break        elif state == psycopg2.extensions.POLL_WRITE:            pass            select.select([], [conn.fileno()], [])        elif state == psycopg2.extensions.POLL_READ:            pass            select.select([conn.fileno()], [], [])        else:            raise psycopg2.OperationalError("poll() returned %s" % state)# Function which returns a connection which can be used for queriesdef connect_to_db(host, hostaddr, dbname, user, password, sslmode = 'verify-full'):    # Get keys from GCS    client = storage.Client()    bucket = client.get_bucket(<YOUR_BUCKET_NAME>)    bucket.get_blob('PATH_TO/server-ca.pem').download_to_filename('server-ca.pem')    bucket.get_blob('PATH_TO/client-key.pem').download_to_filename('client-key.pem')    os.chmod("client-key.pem", stat.S_IRWXU)    bucket.get_blob('PATH_TO/client-cert.pem').download_to_filename('client-cert.pem')    sslrootcert = 'server-ca.pem'    sslkey = 'client-key.pem'    sslcert = 'client-cert.pem'    con = psycopg2.connect(        host = host,        hostaddr = hostaddr,        dbname = dbname,        user = user,        password = password,        sslmode=sslmode,        sslrootcert = sslrootcert,        sslcert = sslcert,        sslkey = sslkey)    return conI then use these functions in a custom ParDo to perform queries.Minimal example:import apache_beam as beamclass ReadSQLTableNames(beam.DoFn):    '''    parDo class to get all table names of a given cloudSQL database.    It will return each table name.    '''    def __init__(self, host, hostaddr, dbname, username, password):        super(ReadSQLTableNames, self).__init__()        self.host = host        self.hostaddr = hostaddr        self.dbname = dbname        self.username = username        self.password = password    def process(self, element):        # Connect do database        con = connect_to_db(host = self.host,            hostaddr = self.hostaddr,            dbname = self.dbname,            user = self.username,            password = self.password)        # Wait for free connection        wait_select(con)        # Create cursor to query data        cur = con.cursor(cursor_factory=RealDictCursor)        # Get all table names        cur.execute(        """        SELECT        tablename as table        FROM pg_tables        WHERE schemaname = 'public'        """        )        table_names = cur.fetchall()        cur.close()        con.close()        for table_name in table_names:            yield table_name["table"]A part of the pipeline then could look like this:# Current workaround to query all tables: # Create a dummy initiator PCollection with one elementinit = p        |'Begin pipeline with initiator' >> beam.Create(['All tables initializer'])tables = init   |'Get table names' >> beam.ParDo(ReadSQLTableNames(                                                host = known_args.host,                                                hostaddr = known_args.hostaddr,                                                dbname = known_args.db_name,                                                username = known_args.user,                                                password = known_args.password))I hope this solution helps others with similar problems

Update – Logfiles

Update logfiles 2

Update: Workaround Solution can be found in my answer below

Advertisement

Answer

Workaround Solution: