Why is this task faster in Python than Julia?

I ran the following code in RStudio:

exo <- read.csv('exoplanets.csv',TRUE,",")
df <- data.frame(exo)

ranks <- 570
files <- 3198
datas <- vector()

for ( w in 2:files ) {
    listas <-vector()
    for ( i in 1:ranks) {
            name <- as.character(df[i,w])
            listas <- append (listas, name)
    }
    datas <- append (datas, listas)
}

JavaScript
​x
 
exo <- read.csv('exoplanets.csv',TRUE,",")
df <- data.frame(exo)
​
ranks <- 570
files <- 3198
datas <- vector()
​
for ( w in 2:files ) {
    listas <-vector()
    for ( i in 1:ranks) {
            name <- as.character(df[i,w])
            listas <- append (listas, name)
    }
    datas <- append (datas, listas)
}
​

It reads a huge NASA CSV file, converts it to a dataframe, converts each element to string, and adds them to a vector.

RStudio took 4 min and 15 seconds.

So I decided to implement the same code in Julia. I ran the following in VS Code:

using CSV, DataFrames

df = CSV.read("exoplanets.csv", DataFrame)

fil, col = 570, 3198
arr = []

for i in 2:fil
        for j in 1:col
            push!(arr, string(df[i, j]))
        end
end

JavaScript
 
using CSV, DataFrames
​
df = CSV.read("exoplanets.csv", DataFrame)
​
fil, col = 570, 3198
arr = []
​
for i in 2:fil
        for j in 1:col
            push!(arr, string(df[i, j]))
        end
end
​

The result was good. The Julia code took only 1 minute and 25 seconds!

Then for pure curiosity I implemented the same code this time in Python to compare. I ran the following in VS Code:

import numpy as np
import pandas as pd

exo = pd.read_csv("exoplanets.csv")
arr = np.array(exo)

fil, col = 570, 3198
lis = []

for i in range(1, fil):
        for j in range(col):
            lis.append(arr[i][j].astype('str'))

JavaScript
 
import numpy as np
import pandas as pd
​
exo = pd.read_csv("exoplanets.csv")
arr = np.array(exo)
​
fil, col = 570, 3198
lis = []
​
for i in range(1, fil):
        for j in range(col):
            lis.append(arr[i][j].astype('str'))
​

The result shocked me! Only 35 seconds!!! And in Spyder from Anaconda only 26 seconds!!! Almost 2 million floats!!! Is Julia slower than Python in data analysis? Can I improve the Julia code?

Answer

NOTE: I wrote the below assuming you want the other column order (as in the Python and R examples). It is more efficient in Julia this way; to make it work equivalently to your original behaviour, permute the logic or your data at the right places (left as an exercise). Bogumił’s anwer does the right thing already.

Put stuff into functions, preallocate where possible, iterate in stride order, use views, and use builtin functions and broadcasting:

function tostringvector(d)
    r, c = size(d)
    result = Vector{String}(undef, r*c)
    v = reshape(result, r, c)
    for (rcol, dcol) in zip(eachcol(v), eachcol(d))
        @inbounds rcol .= string.(dcol)
    end
    return result
end

JavaScript
 
function tostringvector(d)
    r, c = size(d)
    result = Vector{String}(undef, r*c)
    v = reshape(result, r, c)
    for (rcol, dcol) in zip(eachcol(v), eachcol(d))
        @inbounds rcol .= string.(dcol)
    end
    return result
end
​

Which certainly can be optimized harder.

Or shorter, making use of what DataFrames already provides:

tostringvector(d) = vec(Matrix(string.(d)))

JavaScript
 
tostringvector(d) = vec(Matrix(string.(d)))
​

Advertisement

Answer