Skip to content
Advertisement

Dataflow Bigquery-Bigquery pipeline executes on smaller data, but not the large production dataset

A little bit of a newbie to Dataflow here, but have succesfully created a pipleine that works well.

The pipleine reads in a query from BigQuery, applies a ParDo (NLP fucntion) and then writes the data to a new BigQuery table.

The dataset I am trying to process is roughly 500GB with 46M records.

When I try this with a subset of the same data (about 300k records) it works just fine and is speedy see below: enter image description here

When I try run this with the full dataset, it starts super fast, but then tapers off and ultimately fails. At this point the job failed and had added about 900k elements which was about 6-7GB and then the element count actually started decreasing.

enter image description here

I am using 250 workers and a n1-highmem-6 machine type

In the worker logs I get a few of these (about 10):

JavaScript

This was one of the final warnings:

JavaScript

In the execution details there are multiple of these:

JavaScript

I assume that these are from the larger texts in the dataset that can take a while to process, so after a little bit these items are processed and the next ones start.

There is also a few of these:

JavaScript

All of this to say, its a bit perplexing for me, and not completely intuitive – even though the service when it works is awesome.

I am executing the job from a Jupyter noteboook (not using the interactive runner, just executing the script).

Main pipeline is below:

JavaScript

What am I doing wrong? Is this a memory issue? Should I not be reading and writing to BigQuery like this and instead output to a file and create a table from that? Would love some help, and sorry for the long post, wanted to provide as much context as possible.

Advertisement

Answer

Maybe late to the party but I made some tests on a BigQuery table containing 7.7M rows with strings to process of approx. 350 words.

I ran the same kind of pipeline as you:

  1. Read data from BigQuery
  2. Clean the strings using python string library
  3. Using Spacy fr_core_news_lg model, get the lemmatized part of strings
  4. Write the data back to BigQuery (in a different table)

At the beginning I got the same problem as you, number of elements/sec going down with time.

I realized it was a problem with RAM. I changed the machine_type from custom-1-3072 to custom-1-15360-ext and went from the same profile as yours to this one:

case 2

I think Dataflow can handle big amount of rows with NLP model but you have to be sure you give enough RAM to the workers.

Also, using number_of_worker_harness_threads=1 to be sure Dataflow will not spawn multiple threads (and thus spliting ram into the threads) is important.

You can also take a look at this stack thread, the initial problem is the same.

Last thing, my worker’s CPU utilization went from: cpu not enough ram

To : enter image description here

Which is also a sign of lacking RAM.

EDIT: I ran my pipeline using the same data amount scale as you to be sure my test wasn’t biased, and the results are the same: amount of RAM seems to be the key to make the job run smoothly:

test 38M rows

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement