Skip to content
Advertisement

Python 3 – How do I extract data from SQL database and process the data and append to pandas dataframe row by row?

I have a MySQL database, its columns are:

+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int unsigned | NO   | PRI | NULL    | auto_increment |
| artist       | text         | YES  |     | NULL    |                |
| title        | text         | YES  |     | NULL    |                |
| album        | text         | YES  |     | NULL    |                |
| duration     | text         | YES  |     | NULL    |                |
| artistlink   | text         | YES  |     | NULL    |                |
| songlink     | text         | YES  |     | NULL    |                |
| albumlink    | text         | YES  |     | NULL    |                |
| instrumental | tinyint(1)   | NO   |     | 0       |                |
| downloaded   | tinyint(1)   | NO   |     | 0       |                |
| filepath     | text         | YES  |     | NULL    |                |
| language     | json         | YES  |     | NULL    |                |
| genre        | json         | YES  |     | NULL    |                |
| style        | json         | YES  |     | NULL    |                |
| artistgender | text         | YES  |     | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+

I need to extract data from it and process the data and add the data to a pandas DataFrame.

I know how to extract data from SQL database, and I have already implemented a way to pass the data to DataFrame, but it is extremely slow (about 30 seconds), whereas when I used a flat list of namedtuples the operation is tremendously faster (under 3 seconds).

Specifically, filepath is default NULL unless the file is downloaded (currently none of the songs are downloaded), and when Python gets filepath the value will be None, and I need that value become ''.

And because MySQL doesn’t have BOOLEAN type, I need to cast the received ints to bool.

And the language, genre, style fields are tags stored as JSON lists, and they are all currently NULL, when Python gets them they are strings and I need to make them lists using json.loads unless they are None, and if they are None I need to append empty lists instead.

This is my inefficient solution to the problem:

import json
import mysql.connector
from pandas import *

fields = {
    "artist": str(),
    "album": str(),
    "title": str(),
    "id": int(),
    "duration": str(),
    "instrumental": bool(),
    "downloaded": bool(),
    "filepath": str(),
    "language": list(),
    "genre": list(),
    "style": list(),
    "artistgender": str(),
    "artistlink": str(),
    "albumlink": str(),
    "songlink": str(),
}

conn = mysql.connector.connect(
    user="Estranger", password=PWD, host="127.0.0.1", port=3306, database="Music"
)
cursor = conn.cursor()

def proper(x):
    return x[0].upper() + x[1:]

def fetchdata():
    cursor.execute("select {} from songs".format(', '.join(list(fields))))
    data = cursor.fetchall()
    dataframes = list()
    for item in data:
        entry = list(map(proper, item[0:3]))
        entry += [item[3]]
        for j in range(4, 7):
            cell = item[j]
            if isinstance(cell, int):
                entry.append(bool(cell))
            elif isinstance(cell, str):
                entry.append(cell)
        if item[7] is not None:
            entry.append(item[7])
        else:
            entry.append('')
        for j in range(8, 11):
            entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
        entry.append(item[11])
        entry += item[12:15]
        df = DataFrame(fields, index=[])
        row = Series(entry, index = df.columns)
        df = df.append(row, ignore_index=True)
        dataframes.append(df)
    songs = concat(dataframes, axis=0, ignore_index=True)
    songs.sort_values(['artist', 'album', 'title'], inplace=True)
    return songs

Currently there are 4464 songs in the database and the code takes about 30 seconds to finish.

I sorted my SQL database by artist and title and I need to resort the entries by artist, album and title for QTreeWidget, and MySQL sorts data differently from Python and I prefer Python sorting.

In my testing, df.loc and df = df.append() methods are slow, pd.concat is fast, but I really don’t know how to create dataframes with only one row and pass flat lists to dataframe instead of a dictionary, and if there is a faster way than pd.concat, or if operations in the for loop can be vectorized.

How can my code be improved?


I figured out how to create a DataFrame with a list of lists and specify column names, and it is tremendously faster, but I still don’t know how to also specify the data types elegantly without the code throwing errors…

def fetchdata():                                                                          
    cursor.execute("select {} from songs".format(', '.join(list(fields))))                
    data = cursor.fetchall()                                                              
    for i, item in enumerate(data):                                                       
        entry = list(map(proper, item[0:3]))                                              
        entry += [item[3]]                                                                
        for j in range(4, 7):                                                             
            cell = item[j]                                                                
            if isinstance(cell, int):                                                     
                entry.append(bool(cell))                                                  
            elif isinstance(cell, str):                                                   
                entry.append(cell)                                                        
        if item[7] is not None:                                                           
            entry.append(item[7])                                                         
        else:                                                                             
            entry.append('')                                                              
        for j in range(8, 11):                                                            
            entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
        entry.append(item[11])                                                            
        entry += item[12:15]                                                              
        data[i] = entry                                                                   
    songs = DataFrame(data, columns=list(fields), index=range(len(data)))               
    songs.sort_values(['artist', 'album', 'title'], inplace=True)                         
    return songs

And I still need the type conversions, they are already pretty fast, but they don’t look elegant.

Advertisement

Answer

You could make a list of conversion functions for each column:

funcs = [
    str.capitalize,
    str.capitalize,
    str.capitalize,
    int,
    str,
    bool,
    bool,
    lambda v: v if v is not None else '',
    lambda v: json.loads(v) if v is not None else [],
    lambda v: json.loads(v) if v is not None else [],
    lambda v: json.loads(v) if v is not None else [],
    str,
    str,
    str,
    str,
]

Now you can apply the function that converts the value for each field

for i, item in enumerate(data):
    row = [func(field) for field, func in zip(item, funcs)]
    data[i] = row
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement