Tag: pandas

Performance tuning: string wordcount in df

I have a df with column “free text”. I wish to count how many characters and words each cell has. Currently, I do it like this: Problem is, that it is pretty slow. I thought about using np.where but I wasn’t sure how. Would appreciate your help here. Answer IIUC: you can try via str.len() an…

Group by Issue with Years Pandas

dataframe pandas pandas-groupby python

I’m following the answer for this StackOverflow post to group a column of years by decades to make it easier for me to visualize later, but I’m not getting the same results. It seems like when DSM did it, it yielded integers for years, while mine is yielding floats for years. I’ve implemente…

Pandas: efficiently inserting a large number of rows

dataframe numpy pandas performance python

I have a large dataframe in this format, call this df: index val1 val2 0 0.2 0.1 1 0.5 0.7 2 0.3 0.4 I have a row I will be inserting, call this myrow: index val1 val2 -1 0.9 0.9 I wish to insert this row 3 times after every row in the original dataframe, i.e.: index val1 val2 0

stacked chart combine with alluvial plot – python

pandas python sankey-diagram

Surprisingly little info out there regarding python and the pyalluvial package. I’m hoping to combine stacked bars and a corresponding alluvial in the same figure. Using below, I have three unique groups, which is outlined in Group. I want to display the proportion of each Group for each unique Point. I…

How to sum a sequence in pandas?

pandas python

I need to do some coding in python and I can’t do this code: I need to do something like this as result: For me the sequence matters most in my analysis. It’s a sum of the results in interviews. Thanks guys for the help! Answer Here is another approach using reindex and unstack:

How to make multiple plots with seaborn from a wide dataframe

bar-chart matplotlib pandas python seaborn

I’m currently learning about data visualization using seaborn, and I came across a problem that I couldn’t find a solution to. So I have this data index col1 col2 col3 col4 col5 col6 col7 col8 1990 0 4 7 3 7 0 6 6 1991 1 7 5 0 8 1 8 4 1992 0 5 0 1 9 1

Why does matplotlib.pyplot.savefig() mess up image outputs for very large pandas.plotting.scatter_matrix()?

dataframe matplotlib pandas python

I was trying to compute the pandas.plotting.scatter_matrix() values for very large pandas.DataFrame() (relatively speaking for this specific operation, most libraries either run OOM most of the time or implement a row count check of 50000, see vaex-scatter). The ‘Time series’ DataFrame shape I hav…

Python 3 – How do I extract data from SQL database and process the data and append to pandas dataframe row by row?

dataframe mysql pandas python python-3.x

I have a MySQL database, its columns are: I need to extract data from it and process the data and add the data to a pandas DataFrame. I know how to extract data from SQL database, and I have already implemented a way to pass the data to DataFrame, but it is extremely slow (about 30 seconds), whereas when I

How to replace any number of special characters with a space in a dataframe column

pandas python

I have a column in Pandas that has a number of @ characters in between words. The number of consecutive @ is random and I can’t replace them with a single space not blank space since it would create cases such as Original string Replacing with ” Replacing with ‘_’ or single space Sun i…

Python Pandas Dataframe enrichment (from another)

dataframe merge numpy pandas python

I would like to enrich a dataframe (df1) from another(df2) by adding a new column in df1 and enriching it based on what I find in df2. The size of the 2 df is different as well as the name of the columns. I would like to do like a Vlookup function in Excel. This what I’ve done but I