I have a MySQL database, its columns are:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| id | int unsigned | NO | PRI | NULL | auto_increment |
| artist | text | YES | | NULL | |
| title | text | YES | | NULL | |
| album | text | YES | | NULL | |
| duration | text | YES | | NULL | |
| artistlink | text | YES | | NULL | |
| songlink | text | YES | | NULL | |
| albumlink | text | YES | | NULL | |
| instrumental | tinyint(1) | NO | | 0 | |
| downloaded | tinyint(1) | NO | | 0 | |
| filepath | text | YES | | NULL | |
| language | json | YES | | NULL | |
| genre | json | YES | | NULL | |
| style | json | YES | | NULL | |
| artistgender | text | YES | | NULL | |
+--------------+--------------+------+-----+---------+----------------+
I need to extract data from it and process the data and add the data to a pandas DataFrame.
I know how to extract data from SQL database, and I have already implemented a way to pass the data to DataFrame, but it is extremely slow (about 30 seconds), whereas when I used a flat list of namedtuples the operation is tremendously faster (under 3 seconds).
Specifically, filepath is default NULL unless the file is downloaded (currently none of the songs are downloaded), and when Python gets filepath the value will be None, and I need that value become ''
.
And because MySQL doesn’t have BOOLEAN type, I need to cast the received int
s to bool
.
And the language, genre, style fields are tags stored as JSON lists, and they are all currently NULL, when Python gets them they are strings and I need to make them list
s using json.loads
unless they are None, and if they are None I need to append empty lists instead.
This is my inefficient solution to the problem:
import json
import mysql.connector
from pandas import *
fields = {
"artist": str(),
"album": str(),
"title": str(),
"id": int(),
"duration": str(),
"instrumental": bool(),
"downloaded": bool(),
"filepath": str(),
"language": list(),
"genre": list(),
"style": list(),
"artistgender": str(),
"artistlink": str(),
"albumlink": str(),
"songlink": str(),
}
conn = mysql.connector.connect(
user="Estranger", password=PWD, host="127.0.0.1", port=3306, database="Music"
)
cursor = conn.cursor()
def proper(x):
return x[0].upper() + x[1:]
def fetchdata():
cursor.execute("select {} from songs".format(', '.join(list(fields))))
data = cursor.fetchall()
dataframes = list()
for item in data:
entry = list(map(proper, item[0:3]))
entry += [item[3]]
for j in range(4, 7):
cell = item[j]
if isinstance(cell, int):
entry.append(bool(cell))
elif isinstance(cell, str):
entry.append(cell)
if item[7] is not None:
entry.append(item[7])
else:
entry.append('')
for j in range(8, 11):
entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
entry.append(item[11])
entry += item[12:15]
df = DataFrame(fields, index=[])
row = Series(entry, index = df.columns)
df = df.append(row, ignore_index=True)
dataframes.append(df)
songs = concat(dataframes, axis=0, ignore_index=True)
songs.sort_values(['artist', 'album', 'title'], inplace=True)
return songs
Currently there are 4464 songs in the database and the code takes about 30 seconds to finish.
I sorted my SQL database by artist and title and I need to resort the entries by artist, album and title for QTreeWidget, and MySQL sorts data differently from Python and I prefer Python sorting.
In my testing, df.loc
and df = df.append()
methods are slow, pd.concat
is fast, but I really don’t know how to create dataframes with only one row and pass flat lists to dataframe instead of a dictionary, and if there is a faster way than pd.concat
, or if operations in the for loop can be vectorized.
How can my code be improved?
I figured out how to create a DataFrame with a list of lists and specify column names, and it is tremendously faster, but I still don’t know how to also specify the data types elegantly without the code throwing errors…
def fetchdata():
cursor.execute("select {} from songs".format(', '.join(list(fields))))
data = cursor.fetchall()
for i, item in enumerate(data):
entry = list(map(proper, item[0:3]))
entry += [item[3]]
for j in range(4, 7):
cell = item[j]
if isinstance(cell, int):
entry.append(bool(cell))
elif isinstance(cell, str):
entry.append(cell)
if item[7] is not None:
entry.append(item[7])
else:
entry.append('')
for j in range(8, 11):
entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
entry.append(item[11])
entry += item[12:15]
data[i] = entry
songs = DataFrame(data, columns=list(fields), index=range(len(data)))
songs.sort_values(['artist', 'album', 'title'], inplace=True)
return songs
And I still need the type conversions, they are already pretty fast, but they don’t look elegant.
Advertisement
Answer
You could make a list of conversion functions for each column:
funcs = [
str.capitalize,
str.capitalize,
str.capitalize,
int,
str,
bool,
bool,
lambda v: v if v is not None else '',
lambda v: json.loads(v) if v is not None else [],
lambda v: json.loads(v) if v is not None else [],
lambda v: json.loads(v) if v is not None else [],
str,
str,
str,
str,
]
Now you can apply the function that converts the value for each field
for i, item in enumerate(data):
row = [func(field) for field, func in zip(item, funcs)]
data[i] = row