Skip to content
Advertisement

Tag: dataframe

How to drop column according to NAN percentage for dataframe?

For certain columns of df, if 80% of the column is NAN. What’s the simplest code to drop such columns? Answer You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition – so <.8 means remove all columns >=0.8: Sample: If want remove columns by minimal

Rename nested field in spark dataframe

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right

NaN values in pivot_table index causes loss of data

Here is a simple DataFrame: Pivot method 1 The data can be pivoted to this: Downside: data in the 2nd row is lost because df[‘b’][1] == None. Pivot method 2 Downside: column b is lost. How can the two methods be combined so that columns b and the 2nd row are kept like so: More generally: How can information from

Select multiple ranges of columns in Pandas DataFrame

I have to read several files some in Excel format and some in CSV format. Some of the files have hundreds of columns. Is there a way to select several ranges of columns without specifying all the column names or positions? For example something like selecting columns 1 -10, 15, 17 and 50-100: I need to know how to do

Dropping Multiple Columns from a dataframe

I know how to drop columns from a data frame using Python. But for my problem the data set is vast, the columns I want to drop are grouped together or are basically singularly spread out across the column heading axis. Is there a shorter way to slice or drop all the columns with fewer lines of code rather than

pandas concat generates nan values

I am curious why a simple concatenation of two dataframes in pandas: of the same shape and both without NaN values can result in a lot of NaN values if joined. How can I fix this problem and prevent NaN values being introduced? Trying to reproduce it like failed e.g. worked just fine as no NaN values were introduced. Answer

Advertisement