Tag: pandas

Custom transformer for sklearn Pipeline that alters both X and y

machine-learning numpy pandas python scikit-learn

I want to create my own transformer for use with the sklearn Pipeline. I am creating a class that implements both fit and transform methods. The purpose of the transformer will be to remove rows from the matrix that have more than a specified number of NaNs. The issue I am facing is how can I change both the X

How to extract base path from DataFrame column of path strings

os.path pandas python string

There are several questions about string manipulation, but I can’t find an answer which allows me to do the following—I thought it should have been simple… I have a DataFrame which includes a column containing a filename and path The following produces a representative example DataFrame: I want to end up with just the ‘filename’ part of the string. There

Create Python DataFrame from dictionary where keys are the column names and values form the row

dictionary pandas python

I am familiar with python but new to panda DataFrames. I have a dictionary like this: And I would like to convert it to a DataFrame, where b and c are the column names, and the first row is 100,300 (100 is underneath b and 300 is underneath c). I would like a solution that can be generalized to a

Pandas versions compatible with specific python and numpy configurations?

anaconda conda pandas python

Is there a programmatic way to find out which pandas versions are compatible with specific python and numpy configurations? My interest is to get pandas going within ESRI ArcMAP 10.1, which runs on 32-bit Windows and is built on python 2.7, numpy 1.6. I tried creating a conda environment for Python compatible with ESRI ArcMap 10.1 by opening a 32-bit

Setting plot background colour in Seaborn

matplotlib pandas plot python seaborn

I am using Seaborn to plot some data in Pandas. I am making some very large plots (factorplots). To see them, I am using some visualisation facilities at my university. I am using a Compound screen made up of 4 by 4 monitors with small (but nonzero) bevel — the gap between the screens. This gap is black. To minimise

Filtering multiple items in a multi-index Python Panda dataframe

filter indexing pandas python

I have the following table: Note: Both NSRCODE and PBL_AWI are indices. How do I search for values in column PBL_AWI? For example I want to keep the values [‘Lake’, ‘River’, ‘Upland’]. Answer You can get_level_values in conjunction with Boolean slicing. The same idea can be expressed in many different ways, such as df[df.index.get_level_values(‘PBL_AWI’).isin([‘Lake’, ‘River’, ‘Upland’])] Note that you have

How to merge two dataframe in pandas to replace nan

merge nan pandas python

I want to do this in pandas: I have 2 dataframes, A and B, I want to replace only NaN of A with B values. Answer The official way promoted exactly to do this is A.combine_first(B). Further information are in the official documentation. However, it gets outperformed massively with large databases from A.fillna(B) (performed tests with 25000 elements):

Pandas – Compute z-score for all columns

dataframe indexing pandas python statistics

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here’s a subsection of it: Some of my columns contain NaN values which I do not want to include into the z-score calculations so I intend to use a solution offered to this question: how to

Extending numpy.digitize to multi-dimensional data

binning numpy pandas python

I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results. I need all the indices (or all the values, or a mask) of array A