For certain columns of df, if 80% of the column is NAN. What’s the simplest code to drop such columns? Answer You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition – so <.8 means remove all columns >=0.8: Sample: If want remove columns by minimal
Tag: dataframe
Rename nested field in spark dataframe
Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right
Python/Pandas: If Column has multiple values, convert to single row with multiples values in list
In my DataFrame, I have many instances of same AutoNumber having different KeyValue_String. I would like to convert these instances to a single row where the KeyValue_String is a list comprised of the multiple unique values. The desired output would look like this, except I want to keep all of the other columns Answer If I understand correctly, you could
How to import all fields from xls as strings into a Pandas dataframe?
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting. So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99). These import files do have a varying
NaN values in pivot_table index causes loss of data
Here is a simple DataFrame: Pivot method 1 The data can be pivoted to this: Downside: data in the 2nd row is lost because df[‘b’][1] == None. Pivot method 2 Downside: column b is lost. How can the two methods be combined so that columns b and the 2nd row are kept like so: More generally: How can information from
How to get the index of ith item in pandas.Series or pandas.DataFrame?
I’m trying to get the index of 6th item in a Series I have. This is how the head looks like: For getting the 6th index name (6th Country after being sorted), I usually use s.head(6) and get the 6th index from there. s.head(6) gives me: and looking at this, I’m getting the index as United Kingdom. So, is there
Select multiple ranges of columns in Pandas DataFrame
I have to read several files some in Excel format and some in CSV format. Some of the files have hundreds of columns. Is there a way to select several ranges of columns without specifying all the column names or positions? For example something like selecting columns 1 -10, 15, 17 and 50-100: I need to know how to do
Dropping Multiple Columns from a dataframe
I know how to drop columns from a data frame using Python. But for my problem the data set is vast, the columns I want to drop are grouped together or are basically singularly spread out across the column heading axis. Is there a shorter way to slice or drop all the columns with fewer lines of code rather than
pandas concat generates nan values
I am curious why a simple concatenation of two dataframes in pandas: of the same shape and both without NaN values can result in a lot of NaN values if joined. How can I fix this problem and prevent NaN values being introduced? Trying to reproduce it like failed e.g. worked just fine as no NaN values were introduced. Answer
python pandas: filter out records with null or empty string for a given field
I am trying to filter out records whose field_A is null or empty string in the data frame like below: This gives me error: or This one gave no error but didn’t filter out any None values. I also tried: This one doesn’t give error but doesn’t filter out any None values either. Could anyone please advise how to solve