Tag: apache-spark

Converting Pandas dataframe into Spark dataframe error

apache-spark apache-spark-sql pandas python

I’m trying to convert Pandas DF into Spark one. DF head: Code: And I got an error: Answer You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring. If your pandas dataframe lists something like: And you’re getting that error try: Now, make sure .astype(str) is actually the type you want those columns

How can I read in a binary file from hdfs into a Spark dataframe?

apache-spark apache-spark-sql hadoop numpy python

I am trying to port some code from pandas to (py)Spark. Unfortunately I am already failing with the input part, where I want to read in binary data and put it in a Spark Dataframe. So far I am using fromfile from numpy: But for Spark I couldn’t find how to do it. My workaround so far was to use

Filter Pyspark dataframe column with None value

apache-spark apache-spark-sql dataframe pyspark python

I’m trying to filter a PySpark dataframe that has None as a row value: and I can filter correctly with an string value: but this fails: But there are definitely values on each category. What’s going on? Answer You can use Column.isNull / Column.isNotNull: If you want to simply drop NULL values you can use na.drop with subset argument: Equality

Spark-submit: undefined function parse_url

apache-spark apache-spark-sql pyspark python

The function – parse_url always works fine if we working with spark-sql throw sql-client (via thrift server), IPython, pyspark-shell, but it doesn’t work throw spark-submit mode: The error is: So, we are using workaround here: Please, any help with this issue? Answer Spark >= 2.0 Same as below, but use SparkSession with Hive support enabled: Spark < 2.0 parse_url is

Spark SQL Row_number() PartitionBy Sort Desc

apache-spark apache-spark-sql pyspark python window-functions

I’ve successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: That gives me this result: And here I add the desc() to order descending: And get this error: AttributeError: ‘WindowSpec’ object has no attribute ‘desc’ What am I doing wrong here?

How to join on multiple columns in Pyspark?

apache-spark apache-spark-sql join pyspark python

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. I would now like to join them based on multiple columns. I get SyntaxError: invalid syntax with this: Answer You should use & / | operators and be careful about operator precedence (==

How to sort by value efficiently in PySpark?

apache-spark lambda python sorting

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered is good for this if you know how many you need: Using TakeOrdered: Using Lambda I’ve checked out the question here, which suggests the latter. I find it hard to believe that takeOrdered is so succinct and yet it requires the same number

How to add a constant column in a Spark DataFrame?

apache-spark apache-spark-sql dataframe pyspark python

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding

PySpark, importing schema through JSON file

apache-spark apache-spark-sql json pyspark python

tbschema.json looks like this: I load it using following code Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON. The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype. Answer Why does the schema elements gets

Remove duplicates from a dataframe in PySpark

apache-spark duplicates pyspark python

I’m messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error: “AttributeError: ‘list’ object has no attribute ‘dropDuplicates'” Not quite sure why as I seem to be following the syntax in the latest documentation. Answer It is not an import problem. You simply call .dropDuplicates() on a