Short explanation:
If you have duplicate column names in your data, be sure to rename one column when you read the file.
If you have NaN
etc in your data, remove those.
Then merge using correct answer below.
Probably a pretty simple question.
I have two datasets that I read in using pandas.read_csv()
.
My data is in two separate csv.
With the following code:
import mibian import pandas as pd underlying = pd.read_csv("txt1.csv", names=['dt1','price']); options = pd.read_csv("txt2.txt", names=['dt2','ticker','maturity','strike','cP','px','strike','yield','rF','T','rlzd10']); merged = underlying.merge(options, left_on='dt1', right_on='dt2');
My two data heads look like this:
>>> underlying.head(); 0 1 0 20040326 3.579987 1 20040329 3.690494 2 20040330 3.755247 3 20040331 3.719373 4 20040401 3.728671
and
>>> options.head(); 0 1 2 3 4 5 6 7 8 9 10 0 20130628 SVXY 20130817 32.5 call 39.22 32.5 0 0.005 0.136986 0.411224
So my column 0 on either data set is the key i want to merge on, and i want to keep all data from both result sets.
How would I go about doing this? All the examples I am finding online require keys, but I do not have that in my results.
But on the join I get the following errors:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Applications/Spyder.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile execfile(filename, namespace) File "/Users/jasonmellone/.spyder2/.temp.py", line 12, in <module> merged = underlying.merge(options, left_on='dt1', right_on='dt2',how='outer'); File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/frame.py", line 3723, in merge suffixes=suffixes, copy=copy) File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 40, in merge return op.get_result() File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 197, in get_result result_data = join_op.get_result() File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 722, in get_result return BlockManager(result_blocks, self.result_axes) File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/internals.py", line 1954, in __init__ self._set_ref_locs(do_refs=True) File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/internals.py", line 2091, in _set_ref_locs 'have _ref_locs set' % (block, labels)) AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [dt1], 1 x 372145, dtype: int64] with duplicate items [Index([u'dt1', u'price', u'dt2', u'ticker', u'maturity', u'strike', u'cP', u'px', u'strike', u'yield', u'rF', u'T', u'rlzd10'], dtype='object')] does not have _ref_locs set
I searched through my data sets and there are no duplicates.
Thank you!
Advertisement
Answer
You should still be able to merge on the columns:
merged = underlying.merge(options, left_on='0', right_on='0')
This will perform an inner merge so only the intersection of both datasets, i.e. where the values in column 0
exist in both, if you want all values, then specifcy outer
:
merged = underlying.merge(options, left_on='0', right_on='0', how='outer') In [10]: merged = underlying.merge(options, left_on='0', right_on='0', how='outer') merged Out[10]: 0 1_x 1_y 2 3 4 5 6 7 8 0 20040326 3.579987 NaN NaN NaN NaN NaN NaN NaN NaN 1 20040329 3.690494 NaN NaN NaN NaN NaN NaN NaN NaN 2 20040330 3.755247 NaN NaN NaN NaN NaN NaN NaN NaN 3 20040331 3.719373 NaN NaN NaN NaN NaN NaN NaN NaN 4 20040401 3.728671 NaN NaN NaN NaN NaN NaN NaN NaN 5 20130628 NaN SVXY 20130817 32.5 call 39.22 32.5 0 0.005 9 10 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 0.136986 0.411224 [6 rows x 12 columns]
You would have to rename or move the columns that clashed 1_x
and 1_y
above.
It is probably better to rename the columns to something sensible before hand. When reading the csv you can pass a list of column names:
df = pd.read_csv('data.csv', names=['Id', 'Price'])