I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.
For example, this is the Schema of my DF:
root |-- Gender: string (nullable = true) |-- SeniorCitizen: string (nullable = true) |-- MonthlyCharges: double (nullable = true) |-- TotalCharges: double (nullable = true) |-- Churn: string (nullable = true)
This is what I need:
num_cols = [MonthlyCharges, TotalCharges] str_cols = [Gender, SeniorCitizen, Churn]
How can I make it?
Advertisement
Answer
dtypes is list of tuples (columnNane,type) you can use simple filter
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]