I want to create my own transformer for use with the sklearn Pipeline. I am creating a class that implements both fit and transform methods. The purpose of the transformer will be to remove rows from the matrix that have more than a specified number of NaNs. The issue I am facing is how can I change both the X
Tag: pandas
python pandas flatten a dataframe to a list
I have a df like so: I want to flatten the df so it is one continuous list like so: [‘1/2/2014’, ‘a’, ‘6’, ‘z1’, ‘1/2/2014’, ‘a’, ‘3’, ‘z1′,’1/3/2014’, ‘c’, ‘1’, ‘x3’] I can loop through the rows and extend to a list, but is a much easier way to do it? Answer You can use .flatten() on the DataFrame converted
How to extract base path from DataFrame column of path strings
There are several questions about string manipulation, but I can’t find an answer which allows me to do the following—I thought it should have been simple… I have a DataFrame which includes a column containing a filename and path The following produces a representative example DataFrame: I want to end up with just the ‘filename’ part of the string. There
Create Python DataFrame from dictionary where keys are the column names and values form the row
I am familiar with python but new to panda DataFrames. I have a dictionary like this: And I would like to convert it to a DataFrame, where b and c are the column names, and the first row is 100,300 (100 is underneath b and 300 is underneath c). I would like a solution that can be generalized to a
Pandas versions compatible with specific python and numpy configurations?
Is there a programmatic way to find out which pandas versions are compatible with specific python and numpy configurations? My interest is to get pandas going within ESRI ArcMAP 10.1, which runs on 32-bit Windows and is built on python 2.7, numpy 1.6. I tried creating a conda environment for Python compatible with ESRI ArcMap 10.1 by opening a 32-bit
Setting plot background colour in Seaborn
I am using Seaborn to plot some data in Pandas. I am making some very large plots (factorplots). To see them, I am using some visualisation facilities at my university. I am using a Compound screen made up of 4 by 4 monitors with small (but nonzero) bevel — the gap between the screens. This gap is black. To minimise
Filtering multiple items in a multi-index Python Panda dataframe
I have the following table: Note: Both NSRCODE and PBL_AWI are indices. How do I search for values in column PBL_AWI? For example I want to keep the values [‘Lake’, ‘River’, ‘Upland’]. Answer You can get_level_values in conjunction with Boolean slicing. The same idea can be expressed in many different ways, such as df[df.index.get_level_values(‘PBL_AWI’).isin([‘Lake’, ‘River’, ‘Upland’])] Note that you have
How to merge two dataframe in pandas to replace nan
I want to do this in pandas: I have 2 dataframes, A and B, I want to replace only NaN of A with B values. Answer The official way promoted exactly to do this is A.combine_first(B). Further information are in the official documentation. However, it gets outperformed massively with large databases from A.fillna(B) (performed tests with 25000 elements):
Pandas – Compute z-score for all columns
I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here’s a subsection of it: Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to
Extending numpy.digitize to multi-dimensional data
I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results. I need all the indices (or all the values, or a mask) of array A