I am trying to create a pretty basic Glue job. I have two different AWS RDS Mariadb’s, with two similar tables (field names are different). I would like to transform the data from table A so it fits with table B schema (this seems pretty trivial and is working). And then i would like to update all existing entries (on
Tag: aws-glue
Query S3 from Python
I am using python to send a query to Athena and get table DDL. I am using start_query_execution and get_query_execution functions in the awswrangler package. The code above creates a dict object that stores query results in an s3 link. The link can be accessed by res[‘ResultConfiguration’][‘OutputLocation’]. It’s a text link: s3://…..txt Can someone help me figure how to access
How to Bulk insert data into MSSQL database in a AWS Glue python shell job?
I have large sets of data in s3. In my Python glue job, I will be extracting data from those files in the form of a pandas data frame and apply necessary transformations on the data frame and then load it into Microsoft SQL database using PYMSSQL library. The final data frame contains an average of 100-200K rows and 180
AWS Glue python shell – Using multiple libraries
I was using AWS glue python shell. The program uses multiple python libraries which not natively available for AWS. Glue can take .egg or .whl files for external library reference. All we need to do is put these .egg or .whl file in some S3 location and point to it using it’s full path. I tried with one external library
How to make connection from Aws Glue Catalog tables to custom python shell script?
I have some tables in aws glue data catalog which have been created by crawling the data from S3 buckets.I am writing my own python shell script to perform some data trasformations for data in those tables.But how can I make the connection to those tables in data catalog via python script? Answer If you want to access Glue catalog
Col names not detected – AnalysisException: Cannot resolve ‘Name’ given input columns ‘col10’
I’m trying to run a transformation function in a pyspark script: My dataset looks like this: My desired output is something like this: However, the last code line gives me an error similar to this: When I check: I see ‘col1’, ‘col2’ etc in the first row instead of the actual labels ( [“Name”,”Type”] ). Should I separately remove and
Get tables from AWS Glue using boto3
I need to harvest tables and column names from AWS Glue crawler metadata catalogue. I used boto3 but constantly getting number of 100 tables even though there are more. Setting up NextToken doesn’t help. Please help if possible. Desired results is list as follows: lst = [table_one.col_one, table_one.col_two, table_two.col_one….table_n.col_n] UPDATED code, still need to have tablename+columnname: Answer Adding sub-loop did
AWS Glue python install – Could not find a version
I am trying to use the AWSGlue module in Python, but cannot install the module in the terminal. Is there a way around this or is there a way I can download this from a third-party? Does anyone have this AWSGlue module working? Any help would be appreciated. Answer I believe the awsglue package is only available in the images
AWS region in AWS Glue
How can I get the region in which the current Glue job is executing? When the Glue job starts executing, I see the output Detected region eu-central-1. In AWS Lambda, I can use the following lines to fetch the current region: However, it seems like the AWS_REGION environment variable is not present in Glue and therefore a KeyError is raised: