Skip to content
Advertisement

‘numpy.ndarray’ object has no attribute ‘find’ while trying to generate boxplot?

I am trying to generate a box plot. Here is my code, data below:

def loadData(fileName):
 data = pd.read_csv(fileName, quotechar='"')
    cols = data.columns.tolist()

    cols = cols[1:] + [ cols[0] ]
    data = data[cols]
    return data.values

cols={}
cols['close/last']=0
cols['volumne']=1
cols['open']=2
cols['high']=3
cols['low']=4
cols['date']=5

fileName = 'microsoft.csv'

def boxplot():
    data1 = loadData(fileName)
    ithattr1 = cols['high']
    ithattr2 = cols['close/last']
    dataset1 = data1[:,ithattr1]
    dataset2 = data1[:,ithattr2]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.boxplot([dataset1,dataset2])
    plt.show()


boxplot()

The data is float which is verified by the print command as its output is <type 'float'>. On running the code, I am getting the following error (full stacktrace below)

AttributeError: 'numpy.ndarray' object has no attribute 'find'

My data (e.g. in dataset1) looks like this

[52.21 52.2 52.44 52.65 52.33 51.58 51.38 51.68 51.97 53.4163 54.07 53.1
 52.85 53.28 53.485 54.4001 55.39 54.8 56.19 56.78 56.85 55.95 55.96 55.88
 55.48 55.35 56.0 56.79 56.245 55.9 55.21 55.1 55.655 55.87 56.1 55.97
.........................................
 27.54 27.66 28.02 28.05 27.97 28.19 28.13]

output of data.shape = (756,)

Data file format:

2016/01/29,97.3400,64332440.0000,94.7900,97.3400,94.3500
2016/01/28,94.0900,55622370.0000,93.7900,94.5200,92.3900
2016/01/27,93.4200,133059000.0000,96.0400,96.6289,93.3400
2016/01/26,99.9900,71937310.0000,99.9300,100.8800,98.0700
2016/01/25,99.4400,51529980.0000,101.5200,101.5300,99.2100

Stacktrace

Traceback (most recent call last):
  File "plot_curves.py", line 100, in <module>
    boxplot()
  File "plot_curves.py", line 96, in boxplot
    ax.boxplot([dataset1,dataset2])
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3118, in boxplot
    manage_xticks=manage_xticks)
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3480, in bxp
    flier_x, flier_y, **final_flierprops
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3361, in doplot
    return self.plot(*args, **kwargs)
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 1373, in plot
    for line in self._get_lines(*args, **kwargs):
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_base.py", line 304, in _grab_next_args
    for seg in self._plot_args(remaining, kwargs):
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_base.py", line 263, in _plot_args
    linestyle, marker, color = _process_plot_format(tup[-1])
  File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_base.py", line 85, in _process_plot_format
    if fmt.find('--') >= 0:
AttributeError: 'numpy.ndarray' object has no attribute 'find'

Does anybody have any idea, how to resolve it?

Advertisement

Answer

The immediate cause of your problem is that dataset1 and dataset2 are ndarray type, with dtype == object.

Although your values are read in as float type, when you access the column of the values array you return (at the line dataset1 = data1[:,ithattr1]), the dtype is changed to object (as you are actually pulling the data row by row, then extracting a column and numpy has both floats and strings in the row, so has to coerce to the most specific common data type – object).

You can get around this several ways. One is simply to make the arrays into lists, at which point Python coerces what looks like a float to be a float, i.e. change

ax.boxplot([dataset1,dataset2])

to

ax.boxplot([list(dataset1),list(dataset2)])

Another is to add lines explicitly setting the type:

dataset1 = dataset1.astype(np.float)
dataset2 = dataset2.astype(np.float)

This is a gotcha when you access pandas dataframes or numpy arrays containing different data types in columns by index. It’s pretty hard to debug (took me a while to get it for this question and I’ve seen it before – see the edit history)


However, the way you’re handling your data via numerical indices also means you end up having to reorder your columns etc for convenience in your loadData function. A better way would be to let pandas do all the heavy lifting on types etc…

As an example – I’ve put your code into what (I think) is a more conventional pandas / python writing. It’s a bit shorter and doesn’t require the hack to change the data to a list that I give you above. Code is below and output plot after that (using the input data snippet from your question)

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

def loadData(filename,cols):
    data = pd.read_csv(filename, quotechar='"',names=cols,header=None)
    return data

def boxplot(filename,cols):
    data1 = loadData(filename,cols)

    fig = plt.figure()
    ax = fig.add_subplot(111)

    ax.boxplot([data1['high'],data1['close/last']])
    plt.show()

cols=['date','close/last','volume','open','high','low']
filename = 'microsoft.csv'

boxplot(filename,cols)

Output

boxplot of provided data

Advertisement