Tag: apache-spark-sql

Filter Pyspark dataframe column with None value

apache-spark apache-spark-sql dataframe pyspark python

I’m trying to filter a PySpark dataframe that has None as a row value: and I can filter correctly with an string value: but this fails: But there are definitely values on each category. What’s going on? Answer You can use Column.isNull / Column.isNotNull: If you want to simply drop NULL values you can use na.drop with subset argument: Equality

Spark-submit: undefined function parse_url

apache-spark apache-spark-sql pyspark python

The function – parse_url always works fine if we working with spark-sql throw sql-client (via thrift server), IPython, pyspark-shell, but it doesn’t work throw spark-submit mode: The error is: So, we are using workaround here: Please, any help with this issue? Answer Spark >= 2.0 Same as below, but use SparkSession with Hive support enabled: Spark < 2.0 parse_url is

Spark SQL Row_number() PartitionBy Sort Desc

apache-spark apache-spark-sql pyspark python window-functions

I’ve successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: That gives me this result: And here I add the desc() to order descending: And get this error: AttributeError: ‘WindowSpec’ object has no attribute ‘desc’ What am I doing wrong here?

How to join on multiple columns in Pyspark?

apache-spark apache-spark-sql join pyspark python

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. I would now like to join them based on multiple columns. I get SyntaxError: invalid syntax with this: Answer You should use & / | operators and be careful about operator precedence (==

How to add a constant column in a Spark DataFrame?

apache-spark apache-spark-sql dataframe pyspark python

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding

PySpark, importing schema through JSON file

apache-spark apache-spark-sql json pyspark python

tbschema.json looks like this: I load it using following code Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON. The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype. Answer Why does the schema elements gets

How to use JDBC source to write and read data in (Py)Spark?

apache-spark apache-spark-sql pyspark python scala

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages including Scala and R. Answer Writing data Include applicable JDBC driver when you submit the application or start shell. You can