Tag: dataframe

How to drop column according to NAN percentage for dataframe?

For certain columns of df, if 80% of the column is NAN. What’s the simplest code to drop such columns? Answer You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition – so <.8 means remove all columns >=0.8: Sample: If want remove columns by minimal

Rename nested field in spark dataframe

apache-spark dataframe pyspark python rename

Having a dataframe df in Spark: How to rename field array_field.a to array_field.a_renamed? [Update]: .withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method: I know that setting a private attribute is not a good practice but I don’t know other way to set the schema for df I think I am on a right

How to import all fields from xls as strings into a Pandas dataframe?

dataframe excel pandas python python-3.x

I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting. So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99). These import files do have a varying

NaN values in pivot_table index causes loss of data

dataframe pandas pivot python

Here is a simple DataFrame: Pivot method 1 The data can be pivoted to this: Downside: data in the 2nd row is lost because df[‘b’][1] == None. Pivot method 2 Downside: column b is lost. How can the two methods be combined so that columns b and the 2nd row are kept like so: More generally: How can information from

How to get the index of ith item in pandas.Series or pandas.DataFrame?

dataframe pandas python series

I’m trying to get the index of 6th item in a Series I have. This is how the head looks like: For getting the 6th index name (6th Country after being sorted), I usually use s.head(6) and get the 6th index from there. s.head(6) gives me: and looking at this, I’m getting the index as United Kingdom. So, is there

Select multiple ranges of columns in Pandas DataFrame

dataframe numpy pandas python

I have to read several files some in Excel format and some in CSV format. Some of the files have hundreds of columns. Is there a way to select several ranges of columns without specifying all the column names or positions? For example something like selecting columns 1 -10, 15, 17 and 50-100: I need to know how to do

Dropping Multiple Columns from a dataframe

dataframe pandas python

I know how to drop columns from a data frame using Python. But for my problem the data set is vast, the columns I want to drop are grouped together or are basically singularly spread out across the column heading axis. Is there a shorter way to slice or drop all the columns with fewer lines of code rather than

pandas concat generates nan values

concatenation dataframe nan pandas python

I am curious why a simple concatenation of two dataframes in pandas: of the same shape and both without NaN values can result in a lot of NaN values if joined. How can I fix this problem and prevent NaN values being introduced? Trying to reproduce it like failed e.g. worked just fine as no NaN values were introduced. Answer

python pandas: filter out records with null or empty string for a given field

dataframe pandas python

I am trying to filter out records whose field_A is null or empty string in the data frame like below: This gives me error: or This one gave no error but didn’t filter out any None values. I also tried: This one doesn’t give error but doesn’t filter out any None values either. Could anyone please advise how to solve