Skip to content
Advertisement

Why do I got TypeError: cannot pickle ‘_thread.RLock’ object when using pyspark

I’m using spark to deal with my data, like that:

        dataframe_mysql = spark.read.format('jdbc').options(
            url='jdbc:mysql://xxxxxxx',
            driver='com.mysql.cj.jdbc.Driver',
            dbtable='(select * from test_table where id > 100) t',
            user='xxxxxx',
            password='xxxxxx'
        ).load()

        result = spark.sparkContext.parallelize(dataframe_mysql, 1)

But I got this error from spark:

Traceback (most recent call last): File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 46, in process() File “/private/var/www/http/hawk-scripts/hawk_etl/scripts/spark_rds_to_parquet.py”, line 36, in process result = spark.sparkContext.parallelize(dataframe_mysql, 1).map(func) File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 574, in parallelize File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/context.py”, line 611, in _serialize_to_jvm File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 211, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 133, in dump_stream File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 143, in _write_with_length File “/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py”, line 427, in dumps TypeError: cannot pickle ‘_thread.RLock’ object

Am I using it wrong? How should I use SparkContext.parallelize to deal with DataFrame?

Advertisement

Answer

I got it, dataframe_mysql is already a Dataframe, if want to get a RDD, just use dataframe.rdd instead of parallelize

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement