I’m trying to create a user-defined function that takes a cumulative sum of an array and compares the value to another column. Here is a reproducible example: In Pandas, this is the output: In Spark using temp_sdf.withColumn(‘len’, test_function_udf(‘x_ary’, ‘y’)), al…
Tag: pandas
How to subtract date and time in Pandas?
I have data from Pandas which was the contents of a CSV file: I aim to convert the column Date from timestamps to time periods in units of minutes, which should result in something like the following: Answer You can use subtract the first timestampe to calculate the difference, then get total_seconds() and co…
Getting min and max datime for each date in csv
I’m kind of new to data science and Python. First of all, do you suggest using any other Library than pandas when dealing with huge dataset (100K+ rows)? Second of all, let me expose to you my current problem. I have a Dataset in which I have a Datetime column, to make it easy to understand, let’s…
Why does this pandas str.extract pattern work?
I have a dataframe “movies” with column “title”, which contains movie titles and their release year in the following format: The Pirates (2014) I’m testing different ways to extract just the title portion, which in the example above would be “The Pirates”, into a new …
How to collapse overlapping intervals [start-end] and keep the smaller?
I have a Pandas dataframe of intervals defined by 2 numerical coordinates, ‘start’ and ‘end’. I am trying to collapse all intervals that are overlapping, and keep the inner coordinates. Output: The same Pandas dataframe with collapsed intervals and inner coordinates. Two intervals over…
Drop rows from dataframe where problematic values are in separate list
I have a list of problematic rows where there is a unique identifier, all of which I want to remove from a dataframe. I’ve tried to use loc to index them, as follows: where df is 5063 row x 28 cols and toDel[‘GUID’] is a list of GUIDs that I want to remove from the df. I expected this to
Unable to fix “ValueError: DataFrame constructor not properly called!”
I was asked to write a program for Linear Regression with the following steps. Load the R data set mtcars as a pandas dataframe. Build another linear regression model by considering the log of independent variable wt, and log of dependent variable mpg. Fit the model with data, and display the R-squared value …
How to insert today’s date in SQL select statement using python?
I’m trying to send today variable into SQL but it is not working. Answer You don’t have to compute today’s date in Python. Just use the PostgreSQL function CURRENT_DATE:
PyTorch: Dataloader for time series task
I have a Pandas dataframe with n rows and k columns loaded into memory. I would like to get batches for a forecasting task where the first training example of a batch should have shape (q, k) with q referring to the number of rows from the original dataframe (e.g. 0:128). The next example should be (128:256, …
Function that takes n rows as input and returns column names if sum in column equals n
I have a large DataFrame that structures as follows: I am trying to build a function that takes in n row names as arguments sums up the values in all columns and returns me the column names if the sum of those columns equals n. For instance, using label1, label2 and label3 as inputs I would like to obtain the