How do I transpose columns in Pyspark? I want to make columns become rows, and rows become the columns. Here is the input: +—- +——+—–+—–+ |idx | vin |cur | mean| +—- +——+—–+—-…
Tag: pyspark-dataframes
How to read a gzip compressed json lines file into PySpark dataframe?
I have a JSON-lines file that I wish to read into a PySpark data frame. the file is gzipped compressed. The filename looks like this: file.jl.gz I know how to read this file into a pandas data frame: …
Comma separated data in rdd (pyspark) indices out of bound problem
I have a csv file which is comma separated. One of the columns has data which is again comma separated. Each row in that specific column has different no of words , hence different number of commas. …
Pyspark: How to code Complicated Dataframe algorithm problem (summing with condition)
I have a dataframe looks like this: date : sorted nicely Trigger : only T or F value : any random decimal (float) value col1 : represents number of days and can not be lower than -1.** -1<= col1 < infinity** col2 : represents number of days and cannot be negative. col2 >= 0 **Calculation […]
Adjusting incorrect data of a CSV file data in a Pyspark dataframe
I am trying to read CSV file into a dataframe in Pyspark but I have a CSV file which has mixed data. Part of its data belongs to its adjacent column. Is there any way to modify the dataframe in python …
Most efficient way of transforming a date column to a timestamp column + an hour
I want to know if there is a better way of transforming a date column into a datetime column + 1 hour than the method I am currently using. Here is my dataframe: My code: Which gives the output: Does anyone know a more efficient way of doing this. Casting to a timestamp twice seems […]
Pyspark groupBy DataFrame without aggregation or count
Can it iterate through the Pyspark groupBy dataframe without aggregation or count? For example code in Pandas: for i, d in df2: mycode …. ^^ if using pandas ^^ Is there a difference in how …
Extract multiple words using regexp_extract in PySpark
I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word. keys file content this is a keyword part_description file …
Remove duplicates from a dataframe in PySpark
I’m messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error: “AttributeError: ‘list’ object has no attribute ‘dropDuplicates'” Not quite sure why as I seem to be following the syntax in the latest documentation. Answer It is not an import problem. You simply call .dropDuplicates() on a