Skip to content

Tag: dataframe

Rename nested field in spark dataframe

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema…

NaN values in pivot_table index causes loss of data

Here is a simple DataFrame: Pivot method 1 The data can be pivoted to this: Downside: data in the 2nd row is lost because df[‘b’][1] == None. Pivot method 2 Downside: column b is lost. How can the two methods be combined so that columns b and the 2nd row are kept like so: More generally: How can i…

Select multiple ranges of columns in Pandas DataFrame

I have to read several files some in Excel format and some in CSV format. Some of the files have hundreds of columns. Is there a way to select several ranges of columns without specifying all the column names or positions? For example something like selecting columns 1 -10, 15, 17 and 50-100: I need to know h…

Dropping Multiple Columns from a dataframe

I know how to drop columns from a data frame using Python. But for my problem the data set is vast, the columns I want to drop are grouped together or are basically singularly spread out across the column heading axis. Is there a shorter way to slice or drop all the columns with fewer lines of code rather tha…

pandas concat generates nan values

I am curious why a simple concatenation of two dataframes in pandas: of the same shape and both without NaN values can result in a lot of NaN values if joined. How can I fix this problem and prevent NaN values being introduced? Trying to reproduce it like failed e.g. worked just fine as no NaN values were int…