Tag: google-cloud-dataflow

Pass/Refer a SQL file in Apache Beam instead of string

apache-beam dataflow google-cloud-dataflow python sql

I’m trying to run a simple Beam pipeline to extract data from a BQ table using SQL and push to a GCS bucket. My requirement is to pass the SQL from a file (a simple .sql file) and not as a string. I want to modularize the SQL. So far, I’ve tried the following option – it did not work:

How to Read data from Jdbc and write to bigquery using Apache Beam Python Sdk

apache-beam apache-beam-io google-cloud-dataflow python

I am trying to write a Pipeline which will Read Data From JDBC(oracle,mssql) , do something and write to bigquery. I am Struggling in the ReadFromJdbc steps where it was not able to convert it correct schema type. My Code: My data has three columns two of which are Varchar and one is timestamp. Error which i am facing while

ModuleNotFoundError in Dataflow job

apache-beam google-cloud-dataflow google-cloud-platform pipeline python

I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform. My project structure is as follows: Here’s my setup.py Here’s my pipeline code: Functionality of pipeline is Query against BigQuery Table. Count the total records fetched from querying. Print using the custom Log module present in utils folder. I am running the job

How to read multiple JSON files from GCS bucket in google dataflow apache beam python

apache-beam google-cloud-dataflow python

I’m having a bucket in GCS that contain list of JSON files. I came to extract the list of the file names using Now I want to pass this list of filenames to apache beam to read them. I wrote this code, but it doesn’t seem a good pattern Have you faced the same issue before? Answer In the end

Dataflow Bigquery-Bigquery pipeline executes on smaller data, but not the large production dataset

apache-beam dataflow google-bigquery google-cloud-dataflow python

A little bit of a newbie to Dataflow here, but have succesfully created a pipleine that works well. The pipleine reads in a query from BigQuery, applies a ParDo (NLP fucntion) and then writes the data to a new BigQuery table. The dataset I am trying to process is roughly 500GB with 46M records. When I try this with a

Libraries cannot be found on Dataflow/Apache-beam job launched from CircleCI

apache-beam circleci google-api-python-client google-cloud-dataflow python

I am having serious issues running a python Apache Beam pipeline using a GCP Dataflow runner, launched from CircleCI. I would really appreciate if someone could give any hint on how to tackle this, I’ve tried it all but nothing seems to work. Basically, I’m running this python Apache Beam pipeline which runs in Dataflow and uses google-api-python-client-1.12.3. If I

Dataflow BigQuery to BigQuery

apache-beam google-bigquery google-cloud-dataflow python

I am trying to create a dataflow script that goes from BigQuery back to BigQuery. Our main table is massive and breaks the extraction capabilities. I’d like to create a simple table (as a result of a query) containing all the relevant information. The SQL query ‘Select * from table.orders where paid = false limit 10’ is a simple one