I am trying to generate a box plot. Here is my code, data below:
def loadData(fileName): data = pd.read_csv(fileName, quotechar='"') cols = data.columns.tolist() cols = cols[1:] + [ cols[0] ] data = data[cols] return data.values cols={} cols['close/last']=0 cols['volumne']=1 cols['open']=2 cols['high']=3 cols['low']=4 cols['date']=5 fileName = 'microsoft.csv' def boxplot(): data1 = loadData(fileName) ithattr1 = cols['high'] ithattr2 = cols['close/last'] dataset1 = data1[:,ithattr1] dataset2 = data1[:,ithattr2] fig = plt.figure() ax = fig.add_subplot(111) ax.boxplot([dataset1,dataset2]) plt.show() boxplot()
The data is float which is verified by the print command as its output is
<type 'float'>
. On running the code, I am getting the following error (full stacktrace below)
AttributeError: 'numpy.ndarray' object has no attribute 'find'
My data (e.g. in dataset1
) looks like this
[52.21 52.2 52.44 52.65 52.33 51.58 51.38 51.68 51.97 53.4163 54.07 53.1 52.85 53.28 53.485 54.4001 55.39 54.8 56.19 56.78 56.85 55.95 55.96 55.88 55.48 55.35 56.0 56.79 56.245 55.9 55.21 55.1 55.655 55.87 56.1 55.97 ......................................... 27.54 27.66 28.02 28.05 27.97 28.19 28.13]
output of data.shape
= (756,)
Data file format:
2016/01/29,97.3400,64332440.0000,94.7900,97.3400,94.3500 2016/01/28,94.0900,55622370.0000,93.7900,94.5200,92.3900 2016/01/27,93.4200,133059000.0000,96.0400,96.6289,93.3400 2016/01/26,99.9900,71937310.0000,99.9300,100.8800,98.0700 2016/01/25,99.4400,51529980.0000,101.5200,101.5300,99.2100
Stacktrace
Traceback (most recent call last): File "plot_curves.py", line 100, in <module> boxplot() File "plot_curves.py", line 96, in boxplot ax.boxplot([dataset1,dataset2]) File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3118, in boxplot manage_xticks=manage_xticks) File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3480, in bxp flier_x, flier_y, **final_flierprops File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3361, in doplot return self.plot(*args, **kwargs) File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 1373, in plot for line in self._get_lines(*args, **kwargs): File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_base.py", line 304, in _grab_next_args for seg in self._plot_args(remaining, kwargs): File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_base.py", line 263, in _plot_args linestyle, marker, color = _process_plot_format(tup[-1]) File "/home/rohit/anaconda/lib/python2.7/site-packages/matplotlib/axes/_base.py", line 85, in _process_plot_format if fmt.find('--') >= 0: AttributeError: 'numpy.ndarray' object has no attribute 'find'
Does anybody have any idea, how to resolve it?
Advertisement
Answer
The immediate cause of your problem is that dataset1
and dataset2
are ndarray
type, with dtype == object
.
Although your values are read in as float
type, when you access the column of the values
array you return (at the line dataset1 = data1[:,ithattr1]
), the dtype
is changed to object
(as you are actually pulling the data row by row, then extracting a column and numpy
has both floats and strings in the row, so has to coerce to the most specific common data type – object
).
You can get around this several ways. One is simply to make the arrays into lists, at which point Python coerces what looks like a float to be a float, i.e. change
ax.boxplot([dataset1,dataset2])
to
ax.boxplot([list(dataset1),list(dataset2)])
Another is to add lines explicitly setting the type:
dataset1 = dataset1.astype(np.float) dataset2 = dataset2.astype(np.float)
This is a gotcha when you access pandas dataframes or numpy arrays containing different data types in columns by index. It’s pretty hard to debug (took me a while to get it for this question and I’ve seen it before – see the edit history)
However, the way you’re handling your data via numerical indices also means you end up having to reorder your columns etc for convenience in your loadData
function. A better way would be to let pandas do all the heavy lifting on types etc…
As an example – I’ve put your code into what (I think) is a more conventional pandas / python writing. It’s a bit shorter and doesn’t require the hack to change the data to a list that I give you above. Code is below and output plot after that (using the input data snippet from your question)
import matplotlib.pyplot as plt import pandas as pd import numpy as np def loadData(filename,cols): data = pd.read_csv(filename, quotechar='"',names=cols,header=None) return data def boxplot(filename,cols): data1 = loadData(filename,cols) fig = plt.figure() ax = fig.add_subplot(111) ax.boxplot([data1['high'],data1['close/last']]) plt.show() cols=['date','close/last','volume','open','high','low'] filename = 'microsoft.csv' boxplot(filename,cols)