Tag: pyspark

I can’t install spacy model in EMR PySpark notebook

amazon-emr amazon-web-services pyspark python spacy

I currently have an AWS EMR with a linked notebook to that same cluster. I would like to load a spacy model (en_core_web_sm) but first I need to download the model which is usually done using python -m spacy download en_core_web_sm but I really can’t find how to do it in a PySpark Session. Here is my config : I’m

Removing non-ascii and special character in pyspark dataframe column

apache-spark-sql azure-databricks pyspark python

I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters. I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below There are no spaces in my column name. I

Adjusting incorrect data of a CSV file data in a Pyspark dataframe

pyspark python

I am trying to read CSV file into a dataframe in Pyspark but I have a CSV file which has mixed data. Part of its data belongs to its adjacent column. Is there any way to modify the dataframe in python to get the output dataframe as expected. Sample CSV Expected Output Answer You can do this by making use

Most efficient way of transforming a date column to a timestamp column + an hour

apache-spark pyspark python

I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using. Here is my dataframe: My code: Which gives the output: Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems a bit clumsy. Many thanks.

Pyspark groupBy DataFrame without aggregation or count

pyspark python

Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: Answer At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas. ex: Since their is a basic difference between the way the data is handled in pandas

Read avro files in pyspark with PyCharm

apache-spark pycharm pyspark python

I’m quite new to spark, I’ve imported pyspark library to pycharm venv and write below code: , everything seems to be okay but when I want to read avro file I get message: pyspark.sql.utils.AnalysisException: ‘Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section

PySpark udf returns null when function works in Pandas dataframe

pandas pyspark python user-defined-functions

I’m trying to create a user-defined function that takes a cumulative sum of an array and compares the value to another column. Here is a reproducible example: In Pandas, this is the output: In Spark using temp_sdf.withColumn(‘len’, test_function_udf(‘x_ary’, ‘y’)), all of len ends up being null. Would anyone know why this is the case? Also, replacing cumsum_array = np.cumsum(np.flip(x_ary)) fails

join two patrition dataframe pyspark

pyspark python python-3.x

I have two dataframes with partition level 2. Dataframes are small probably around 100 rows each. df1 : df2: my final df will be join of df1 and df2 based on columnindex. But when I am joining two data frames as per below it looks it is shuffling and giving me the incorrect results. Is there any way I can

I need to append only those who has non null values in pyspark dataframe

apache-spark-sql arrays pyspark python

I am having the pyspark dataframe (df) having below sample table (table1): id, col1, col2, col3 1, abc, null, def 2, null, def, abc 3, def, abc, null I am trying to get new column (final) by appending the all the columns by ignoring null values. I have tried pyspark code and used f.array(col1, col2, col3). Values are getting appended

Extract multiple words using regexp_extract in PySpark

apache-spark apache-spark-sql pyspark python

I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word. keys file content this is a keyword part_description file content 32015 this is a keyword hello world Code Outputs Expected output I want to return all matching keyword and their count and