From a PySpark SQL dataframe like
name age city abc 20 A def 30 B
How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe).
And how can I access the dataframe rows by index.like row no. 12 or 200 .
In pandas I can do
df.tail(1) # for last row df.ix[rowno or index] # by index df.loc[] or by df.iloc[]
I am just curious how to access pyspark dataframe in such ways or alternative ways.
Thanks
Advertisement
Answer
How to get the last row.
Long and ugly way which assumes that all columns are oderable:
from pyspark.sql.functions import ( col, max as max_, struct, monotonically_increasing_id ) last_row = (df .withColumn("_id", monotonically_increasing_id()) .select(max(struct("_id", *df.columns)) .alias("tmp")).select(col("tmp.*")) .drop("_id"))
If not all columns can be order you can try:
with_id = df.withColumn("_id", monotonically_increasing_id()) i = with_id.select(max_("_id")).first()[0] with_id.where(col("_id") == i).drop("_id")
Note. There is last
function in pyspark.sql.functions
/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.
how can I access the dataframe rows by index.like
You cannot. Spark DataFrame
and accessible by index. You can add indices using zipWithIndex
and filter later. Just keep in mind this O(N) operation.