I have 3 files per date per name in this format: ‘nameXX_date’, here’s an example: ‘nameXX_01-01-20’ ‘nameXY_01-01-20’ ‘nameXZ_01-01-20’
where ‘name’ can be anything, and the date is whatever day the file was uploaded (almost every day).
I need to write a cloud function that triggers whenever a new file lands in the bucket, that combines the 3 XX,XY,XZ files into one file with filename = “name_date”.
Here’s what I’ve got so far:
bucket_id = 'bucketname' client = gcs.Client() bucket = client.get_bucket(bucket_id) name = date = outfile = f'bucketname/{name}_{date}.CSV' blobs = [] for shard in ('XX', 'XY', 'XZ'): sfile = f'{name}{shard}_{date}' blob = bucket.blob(sfile) if not blob.exists(): # this causes a retry in 60s raise ValueError(f'branch {sfile} not present') blobs.append(blob) bucket.blob(outfile).compose(blobs) logging.info(f'Successfullt created {outfile}') for blob in blobs: blob.delete() logging.info('Deleted {} blobs'.format(len(blobs)))
The issue I’m facing is that I’m not sure how to get the name and date of the new file that landed in the bucket, so that I can find the other 2 matching files and combine them
Btw, I’ve got this code from this article and I’m trying to implement it here: https://medium.com/google-cloud/how-to-write-to-a-single-shard-on-google-cloud-storage-efficiently-using-cloud-dataflow-and-cloud-3aeef1732325
Advertisement
Answer
As I understand, the cloud function is triggered by a google.storage.object.finalize
event on an object in the specific GCS bucket.
In that case your cloud function “signature” looks like (taken from the “medium” article you mentioned):
def compose_shards(data, context):
The data
is a dictionary with plenty of details about the object (file) has been finalized
. See some details here: Google Cloud Storage Triggers
For example, the data["name"]
– is the name of the object under discussion.
If you know the pattern/template/rule according to which those objects/shards are named, you can extract the relevant elements from an object/shard name, and use it to compose the target object/file name.