How to Bulk insert data into MSSQL database in a AWS Glue python shell job?

Question

I have large sets of data in s3. In my Python glue job, I will be extracting data from those files in the form of a pandas data frame and apply necessary transformations on the data frame and then load it into Microsoft SQL database using PYMSSQL library. The final data frame contains an average of 100-200K rows and 180

Accepted Answer

I ended up doing this and gave me much better results(1 Million in 11 Min):(Use Glue 2.0 python job instead of python shell job)Extracted the data from s3Transformed it using PandasUploaded the transformed file as a CSV to s3.Created a dynamic frame from a catalog table that was created using a crawler by crawling the transformed CSV file. Or You can create dynamic frame directly using Options.Synchronize the dynamic frame to the catalog table that was created using a crawler by crawling the Destination MSSQL table. csv_buffer = StringIO() s3_resource = boto3.resource("s3", region_name=AWS_REGION)     file=s3.get_object(Bucket=S3_BUCKET_NAME,Key=each_file) for chunk in pd.read_csv(file['Body'],sep=",",header=None,low_memory=False,chunksize=100000):  all_data.append(chunk) data_frame = pd.concat(all_data, axis= 0) all_data.clear() cols = data_frame.select_dtypes(object).columns     data_frame[cols] = data_frame[cols].apply(lambda x: x.str.strip())     data_frame.replace(to_replace ='',value =np.nan,inplace=True)     data_frame.fillna(value=np.nan, inplace=True)     data_frame.insert(0,'New-column', 1234) data_frame.to_csv(csv_buffer) result=s3_resource.Object(S3_BUCKET_NAME, 'path in s3').put(Body=csv_buffer.getvalue())datasource0 = glueContext.create_dynamic_frame.from_catalog(database = &#8220;source db name&#8221;, table_name = &#8220;source table name&#8221;, transformation_ctx = &#8220;datasource0&#8221;)applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [mappings], transformation_ctx = &#8220;applymapping1&#8221;)selectfields2 = SelectFields.apply(frame = applymapping1, paths = [column names of destination catalog table], transformation_ctx = &#8220;selectfields2&#8221;)resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = &#8220;MATCH_CATALOG&#8221;, database = &#8220;destination dbname&#8221;, table_name = &#8220;destination table name&#8221;, transformation_ctx = &#8220;resolvechoice3&#8221;)resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = &#8220;make_cols&#8221;, transformation_ctx = &#8220;resolvechoice4&#8221;)datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = &#8220;destination db name&#8221;, table_name = &#8220;destination table name&#8221;, transformation_ctx = &#8220;datasink5&#8221;)job.commit()

Advertisement

Answer