I’m using spark to deal with my data, like that:
dataframe_mysql = spark.read.format('jdbc').options( url='jdbc:mysql://xxxxxxx', driver='com.mysql.cj.jdbc.Driver', dbtable='(select * from test_table where id > 100) t', user='xxxxxx', password='xxxxxx' ).load() result = spark.sparkContext.parallelize(dataframe_mysql, 1)
But I got this error from spark:
Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 36, in process result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func) File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 574, in parallelize File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 611, in _serialize_to_jvm File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 211, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 133, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 143, in _write_with_length File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 427, in dumps TypeError: cannot pickle ‘_thread.RLock’ object
Am I using it wrong? How should I use SparkContext.parallelize to deal with DataFrame?
Advertisement
Answer
I got it, dataframe_mysql
is already a Dataframe
, if want to get a RDD
, just use dataframe.rdd
instead of parallelize