Tag: bigdata

Plotting top 10 Values in Big Data

bigdata dataframe exploratory-data-analysis pandas python

I need help plotting some categorical and numerical Values in python. the code is given below: However, the data size is so huge (Big data) that I’m not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given b…

Dask “Column assignment doesn’t support type numpy.ndarray”

bigdata dask dask-dataframe multiple-conditions python

I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions. But, then I got the following error message. The above code works perfectly when using np.where with pandas dataframe, but didn’t work with da…

(python) quicksort working for ordered data, but not for unordered data

bigdata python quicksort recursion sorting

I am working on an implementation of recursive quicksort in python. I am working with very large data sets (10,000 – 1,000,000 elements). When feeding it ordered data (i.e. changing an array sorted from largest -> smallest to smallest -> largest) it works fine. But when giving it unordered data, i…

pyspark regex extract all

apache-spark bigdata pyspark python regex

I have a dataframe like below. I am trying to extract the next word after function or var My code is here. as it is capture only one word, the final row returns only AWS and not Twitter. So I would like to capture all matching. My spark version is less than 3, so I tried df.withColumn(‘output’, f.…

Pyspark: how to duplicate a row n time in dataframe?

bigdata pyspark python

I’ve got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one: And transform like this: I think I should use explode, but I don’t understand how it works… Thanks Answer The explode function returns a new row for each element in the given array or m…

sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

bigdata cluster-analysis data-mining python scikit-learn

I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow. Is there a faster method to determine the optimal number of cluster? Or should I change the clustering algorithm? If yes, which i…