I have a MySQL database, its columns are:
+--------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------+--------------+------+-----+---------+----------------+ | id | int unsigned | NO | PRI | NULL | auto_increment | | artist | text | YES | | NULL | | | title | text | YES | | NULL | | | album | text | YES | | NULL | | | duration | text | YES | | NULL | | | artistlink | text | YES | | NULL | | | songlink | text | YES | | NULL | | | albumlink | text | YES | | NULL | | | instrumental | tinyint(1) | NO | | 0 | | | downloaded | tinyint(1) | NO | | 0 | | | filepath | text | YES | | NULL | | | language | json | YES | | NULL | | | genre | json | YES | | NULL | | | style | json | YES | | NULL | | | artistgender | text | YES | | NULL | | +--------------+--------------+------+-----+---------+----------------+
I need to extract data from it and process the data and add the data to a pandas DataFrame.
I know how to extract data from SQL database, and I have already implemented a way to pass the data to DataFrame, but it is extremely slow (about 30 seconds), whereas when I used a flat list of namedtuples the operation is tremendously faster (under 3 seconds).
Specifically, filepath is default NULL unless the file is downloaded (currently none of the songs are downloaded), and when Python gets filepath the value will be None, and I need that value become ''
.
And because MySQL doesn’t have BOOLEAN type, I need to cast the received int
s to bool
.
And the language, genre, style fields are tags stored as JSON lists, and they are all currently NULL, when Python gets them they are strings and I need to make them list
s using json.loads
unless they are None, and if they are None I need to append empty lists instead.
This is my inefficient solution to the problem:
import json import mysql.connector from pandas import * fields = { "artist": str(), "album": str(), "title": str(), "id": int(), "duration": str(), "instrumental": bool(), "downloaded": bool(), "filepath": str(), "language": list(), "genre": list(), "style": list(), "artistgender": str(), "artistlink": str(), "albumlink": str(), "songlink": str(), } conn = mysql.connector.connect( user="Estranger", password=PWD, host="127.0.0.1", port=3306, database="Music" ) cursor = conn.cursor() def proper(x): return x[0].upper() + x[1:] def fetchdata(): cursor.execute("select {} from songs".format(', '.join(list(fields)))) data = cursor.fetchall() dataframes = list() for item in data: entry = list(map(proper, item[0:3])) entry += [item[3]] for j in range(4, 7): cell = item[j] if isinstance(cell, int): entry.append(bool(cell)) elif isinstance(cell, str): entry.append(cell) if item[7] is not None: entry.append(item[7]) else: entry.append('') for j in range(8, 11): entry.append(json.loads(item[j])) if item[j] is not None else entry.append([]) entry.append(item[11]) entry += item[12:15] df = DataFrame(fields, index=[]) row = Series(entry, index = df.columns) df = df.append(row, ignore_index=True) dataframes.append(df) songs = concat(dataframes, axis=0, ignore_index=True) songs.sort_values(['artist', 'album', 'title'], inplace=True) return songs
Currently there are 4464 songs in the database and the code takes about 30 seconds to finish.
I sorted my SQL database by artist and title and I need to resort the entries by artist, album and title for QTreeWidget, and MySQL sorts data differently from Python and I prefer Python sorting.
In my testing, df.loc
and df = df.append()
methods are slow, pd.concat
is fast, but I really don’t know how to create dataframes with only one row and pass flat lists to dataframe instead of a dictionary, and if there is a faster way than pd.concat
, or if operations in the for loop can be vectorized.
How can my code be improved?
I figured out how to create a DataFrame with a list of lists and specify column names, and it is tremendously faster, but I still don’t know how to also specify the data types elegantly without the code throwing errors…
def fetchdata(): cursor.execute("select {} from songs".format(', '.join(list(fields)))) data = cursor.fetchall() for i, item in enumerate(data): entry = list(map(proper, item[0:3])) entry += [item[3]] for j in range(4, 7): cell = item[j] if isinstance(cell, int): entry.append(bool(cell)) elif isinstance(cell, str): entry.append(cell) if item[7] is not None: entry.append(item[7]) else: entry.append('') for j in range(8, 11): entry.append(json.loads(item[j])) if item[j] is not None else entry.append([]) entry.append(item[11]) entry += item[12:15] data[i] = entry songs = DataFrame(data, columns=list(fields), index=range(len(data))) songs.sort_values(['artist', 'album', 'title'], inplace=True) return songs
And I still need the type conversions, they are already pretty fast, but they don’t look elegant.
Advertisement
Answer
You could make a list of conversion functions for each column:
funcs = [ str.capitalize, str.capitalize, str.capitalize, int, str, bool, bool, lambda v: v if v is not None else '', lambda v: json.loads(v) if v is not None else [], lambda v: json.loads(v) if v is not None else [], lambda v: json.loads(v) if v is not None else [], str, str, str, str, ]
Now you can apply the function that converts the value for each field
for i, item in enumerate(data): row = [func(field) for field, func in zip(item, funcs)] data[i] = row