Tag: pyspark

pyspark matplotlib integration with Zeppelin

apache-zeppelin matplotlib pyspark python

I’m trying to draw histogram using pyspark in Zeppelin notebook. Here is what I have tried so far, This code run without no errors but this does not give the expected plot. So I googled and found this documantation, According to this, I tried to enable angular flag as follows, But now I’m getting an error called No module named

spark-nlp ‘JavaPackage’ object is not callable

apache-spark johnsnowlabs-spark-nlp pyspark python python-3.x

I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: I get the following error: I read a few of the github issues developers raised in spark-nlp repo, but the fixes are not working for me. I am wondering if the use of pyenv is causing problems, but it works

How to calculate cumulative sum over date range excluding weekends in PySpark 2.0?

apache-spark pyspark python

This is an extension to an earlier question I raised here How to calculate difference between dates excluding weekends in PySpark 2.2.0. My spark dataframe looks like below and can be generated with the accompanying code: I am trying to calculate cumulative sums over a period of 2,3,4,5 & 30 days. Below is a sample code for 2 days and

Interpolation in PySpark throws java.lang.IllegalArgumentException

apache-spark pyspark python

I don’t know how to interpolate in PySpark when the DataFrame contains many columns. Let me xplain. I need to group by webID and interpolate counts values at 1 minute interval. However, when I apply the below-shown code, Error: Answer Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1. https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow–0150-and-spark-23x-24x

Comma separated data in rdd (pyspark) indices out of bound problem

pyspark python rdd

I have a csv file which is comma separated. One of the columns has data which is again comma separated. Each row in that specific column has different no of words , hence different number of commas. When I access or perform any sort of operation like filtering (after splitting the data) it throws errors in pyspark. How shall I

How do I get VSCode to recognize Python3 as my default

pyspark python python-3.x visual-studio-code

I have python3 install on my Mac and I’m in the terminal, I use python3 by default. However when I’m in VSCode it is not recognizing python3 as my default, it’s still pulling in python2.7. Here is a screenshot of my VScode environement: I have code-runner with python3 selected as well as my interpreter as 3.8 When I run my

Why do we use pyspark UDF when python functions are faster than them? (Note. Not worrying about spark SQL commands)

pyspark python user-defined-functions

I have a dataframe: Output: Now, a simple – Add 1 to ‘v’ can be done via SQL functions and UDF. If we ignore the SQL (best performant) We can create a UDF as: and call it: Time: 16.5sec But here is my question: if I DO NOT use udf and directly write: Time Taken – 352ms In a nutshell,

How to get the N most recent dates in Pyspark

apache-spark apache-spark-sql pyspark python

Is there a way to get the most 30 recent days worth of records for each grouping of data in Pyspark? In this example, get the 2 records with the most recent dates within the groupings of (Grouping, Bucket). So a table like this Would turn into this: Edit: I reviewed my question after edit and realized that not doing

Uploading files from Azure Blob Storage to SFTP location using Databricks?

azure azure-databricks pyspark python scala

I have a scenario where I need to copy files from Azure Blob Storage to SFTP location in Databricks Is there a way to achieve this scenario using pySpark or Scala? Answer Regarding the issue, please refer to the following steps (I use scala) Mount Azure Blob storage containers to DBFS Copy these file to clusters local file system Code.

Pyspark: How to code Complicated Dataframe algorithm problem (summing with condition)

apache-spark-sql pyspark python

I have a dataframe looks like this: date : sorted nicely Trigger : only T or F value : any random decimal (float) value col1 : represents number of days and can not be lower than -1.** -1<= col1 < infinity** col2 : represents number of days and cannot be negative. col2 >= 0 **Calculation logic ** If col1 ==