I need help plotting some categorical and numerical Values in python. the code is given below: However, the data size is so huge (Big data) that I’m not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
Tag: bigdata
Dask “Column assignment doesn’t support type numpy.ndarray”
I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions. But, then I got the following error message. The above code works perfectly when using np.where with pandas dataframe, but didn’t work with dask.array.where. Answer If numpy works and the operation is
(python) quicksort working for ordered data, but not for unordered data
I am working on an implementation of recursive quicksort in python. I am working with very large data sets (10,000 – 1,000,000 elements). When feeding it ordered data (i.e. changing an array sorted from largest -> smallest to smallest -> largest) it works fine. But when giving it unordered data, it doesn’t seem to work at all. I’m using a
pyspark regex extract all
I have a dataframe like below. I am trying to extract the next word after function or var My code is here. as it is capture only one word, the final row returns only AWS and not Twitter. So I would like to capture all matching. My spark version is less than 3, so I tried df.withColumn(‘output’, f.expr(“regexp_extract_all(js, ‘(func)s+(w+)|(var)s+(w+)’, 4)”)).show()
Pyspark: how to duplicate a row n time in dataframe?
I’ve got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one: And transform like this: I think I should use explode, but I don’t understand how it works… Thanks Answer The explode function returns a new row for each element in the given array or map. One way
sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets
I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow. Is there a faster method to determine the optimal number of cluster? Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with