Skip to content
Advertisement

Tag: bigdata

pyspark regex extract all

I have a dataframe like below. I am trying to extract the next word after function or var My code is here. as it is capture only one word, the final row returns only AWS and not Twitter. So I would like to capture all matching. My spark version is less than 3, so I tried df.withColumn(‘output’, f.expr(“regexp_extract_all(js, ‘(func)s+(w+)|(var)s+(w+)’, 4)”)).show()

Pyspark: how to duplicate a row n time in dataframe?

I’ve got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one: And transform like this: I think I should use explode, but I don’t understand how it works… Thanks Answer The explode function returns a new row for each element in the given array or map. One way

sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow. Is there a faster method to determine the optimal number of cluster? Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with

Advertisement