Skip to content
Advertisement

Selecting only numeric/string columns names from a Spark DF in pyspark

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.

For example, this is the Schema of my DF:

root
 |-- Gender: string (nullable = true)
 |-- SeniorCitizen: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)

This is what I need:

num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]

How can I make it?

Advertisement

Answer

dtypes is list of tuples (columnNane,type) you can use simple filter

 columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement