Skip to content

Tag: apache-spark

PySpark: Performing One-Hot-Encoding

I need to perform classification task on a dataset which consists categorical variables. I performed the one-hot encoding on that data. But I am confused that whether I am doing it right way or not. Step 1: Lets say, for example, this is a dataset: Step 2: After performing one-hot encoding it gives this data: Step 3: Here the fourth

Replicate a function from pandas into pyspark

I am trying to execute the same function on a spark dataframe rather than pandas. Answer A direct translation would require you to do multiple collect for each column calculation. I suggest you do all calculations for columns in the dataframe as a single row and then collect that row. Here’s an example. Calculate percentage of whitespace values and number

I am getting error while defining H2OContext in python spark script

Code: I am using spark standalone cluster 3.2.1 and try to initiate H2OContext in python file. while trying to run the script using spark-submit, i am getting following error: Spark-submit command: spark-submit –master spark://local:7077 –packages ai.h2o:sparkling-water-package_2.12: spark_h20/ Answer The parameter –packages ai.h2o:sparkling-water-package_2.12: downloads a jar artifact from Maven. This artifact could be used only for Scala/Java. I see there is

How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API?

I’m totally new to Pyspark, as Pyspark doesn’t have loc feature how can we write this logic. I tried by specifying conditions but couldn’t get the desirable result, any help would be greatly appreciated! Answer For a data like the following You’re actually updating total column in each statement, not in an if-then-else way. Your code can be replicated (as

Get value from Spark dataframe when rows are dictionaries

I have a PySpark dataframe that looks like this: Values Column {[0.0, 54.04, 48…. Sector A {[0.0, 55.4800000… Sector A If I show the first element of the column ‘Values’ without truncating the data, it looks like this: {[0.0, 54.04, 48.19, 68.59, 61.81, 54.730000000000004, 48.51, 57.03, 59.49, 55.44, 60.56, 52.52, 51.44, 55.06, 55.27, 54.61, 55.89, 56.5, 45.4, 68.63, 63.88, 48.25,