I have a dataframe like this (the real one is 7 million records and 345 features) the following image is only a small fraction related to if a cliente make an operation in a month. What I want to do is create a column at the end with the mean difference between each operation. For example in the first record
Tag: pandas
Update column based on other column condition
I need to update vid or maybe create a new column based on the change column df = [{‘vid’: 14, ‘change’: 0}, {‘vid’: 15, ‘change’: 1}, {‘vid’: 16, ‘change’: 0}, {‘vid’: 16, ‘change’: 0}, {‘vid’: 17, …
How to split parallel corpora while keeping alignment?
I have two text files containing parallel text in two languages (potentially millions of lines). I am trying to generate random train/validate/test files from that single file, as train_test_split does in sklearn. However when I try to import it into pandas using read_csv I get errors from many of the lines b…
Pandas: Remove Column Based on Threshold Criteria
I have to solve this problem: Objective: Drops columns most of whose rows missing Inputs: 1. Dataframe df: Pandas dataframe 2. threshold: Determines which columns will be dropped. If threshold is .9, the columns with 90% missing value will be dropped Outputs: 1. Dataframe df with dropped columns (if no column…
How to create subplots from each column in a pandas dataframe
I have a dataframe ‘df’ with 36 columns, these columns are plotted onto a single plotly chart and displayed in html format using the code below. I want to iterate through each column and create a subplot for each one. I have tried; I created 6 rows and columns as that would give 36 plots and tried…
Remove timezone (+01:00) from DateTime
I would like to delete the timezone from my dateTime object. Currently i have: 2019-02-21 15:31:37+01:00 Expected output: 2019-02-21 15:31:37 The code I have converts it to: 2019-02-21 14:31:37. Answer In the first line, the parameter utc=True is not necessary as it converts the input to UTC (subtracting one …
Pandas rolling sum with groupby and conditions
I have a dataframe with a timeseries of sales of different items with customer analytics. For each item and a given day I want to compute: a share of my best customer in last 2 days total sales a share of my top customers (from a list) in last 2 days total sales I’ve tried solutions provided here: for r…
Pandas – Duplicate Rows and Slice String
I’m trying to create duplicate rows during a dataframe on conditions. For example, I have this Dataframe. And I would like to get the following output: Answer For pandas 0.25+ is possible use DataFrame.explode with splitted values by Series.str.split and for remark column list comprehension with filteri…
Is it possible to display pandas styles in the IPython console?
Is it possible to display pandas styles in an iPython console? The following code in a Jupyter notebook correctly produces In the console I only get Is it possible to achieve a similar result here, or is the style engine dependent on an html frontend? Thanks in advance for any help. Answer I believe that the …
Drop rows that contains the data between specific dates
The file contains data by date and time: All I want I want drop rows that contains between these dates and includes the start and end dates: Any Idea? Answer Sample: Use boolean indexing for filter by condition with chain by | for bitwise OR: Or filter by Series.between and invert mask by ~: