Skip to content
Advertisement

Trouble subtracting two column values correctly/precisely in pandas dataframe in Python

I’m trying to create a new column in my pandas dataframe which will be the difference of two other columns, but the new column has values that are significantly different what what the differences between the values of the columns are. I have heard that ‘float’ values often don’t subtract precisely, so I have tried to convert the decimal values here to integers by changing the columns’ dtypes to ‘int64’ (as suggested here Pandas Subtract Two Columns Not Working Correctly) and then multiplying each value by 100000:

# Read in data
data = pd.read_csv('/Users/aaron/Downloads/treatment_vs_control.csv')

# Cleaning and preprocessing
data.rename({"names": "Gene"}, axis=1, inplace=True)
columns = data.columns
garbage_filter = columns.str.startswith('names') | columns.str.startswith('0') | columns.str.startswith('1') | 
columns.str.startswith('Gene')
data = data.loc[:,garbage_filter]
scores_filter = columns.str.endswith('scores')
columns = data.columns

scores_filter = columns.str.endswith('scores')
data = data.iloc[:,~scores_filter]

## To create Diff columns correctly, change logFC columns to integer dtype
data = data.astype({'1_logfoldchanges': 'int64', '0_logfoldchanges': 'int64'})
data['1_logfoldchanges'] = data['1_logfoldchanges'] * 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] * 100000
data["diff_logfoldchanges0"] = data['0_logfoldchanges'] - data['1_logfoldchanges']
data["diff_logfoldchanges1"] = data['1_logfoldchanges'] - data['0_logfoldchanges']

data['1_logfoldchanges'] = data['1_logfoldchanges'] / 100000
data['0_logfoldchanges'] = data['0_logfoldchanges'] / 100000
data['diff_logfoldchanges0'] = data['diff_logfoldchanges0'] / 100000
data['diff_logfoldchanges1'] = data['diff_logfoldchanges1'] / 100000
data = data.astype({'1_logfoldchanges': 'float64', '0_logfoldchanges': 'float64'})

data.sort_values('diff_logfoldchanges0', ascending=False, inplace=True)

The values in the new column still do not equal the difference in the two original columns and I haven’t been able to find any questions on this site or others that have been able to help me resolve this. Could someone point out how I could fix this? I would be extremely grateful for any help.

For reference, here is a snapshot of my data with the incorrect difference-column values: enter image description here

EDIT: Here is a a bit of my CSV data too:

names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991

Ideally, I’d like to create a ‘diff_logfoldchanges0’ column that is equal to the values from the ‘0_logfoldchanges’ column minus the values from the ‘1_logfoldchanges’ column. In the CSV data below, I believe that might be “-1.3434665 – -2.952315”, “-15.159286 – -15.13933”, and “-0.015067068 – 0.18173021”.

Advertisement

Answer

pd.read_csv by default uses a fast but less precise way of reading floating point numbers

import pandas as pd
import io

csv = """names,0_scores,0_logfoldchanges,0_pvals,0_pvals_adj,1_scores,1_logfoldchanges,1_pvals,1_pvals_adj,2_scores,2_logfoldchanges,2_pvals,2_pvals_adj,3_scores,3_logfoldchanges,3_pvals,3_pvals_adj,4_scores,4_logfoldchanges,4_pvals,4_pvals_adj,5_scores,5_logfoldchanges,5_pvals,5_pvals_adj,6_scores,6_logfoldchanges,6_pvals,6_pvals_adj,7_scores,7_logfoldchanges,7_pvals,7_pvals_adj,8_scores,8_logfoldchanges,8_pvals,8_pvals_adj
0610005C13Rik,-0.06806567,-1.3434665,0.9457333570044608,0.9996994148075796,-0.06571575,-2.952315,0.9476041278614572,0.9998906553041256,0.17985639,1.9209933,0.8572653106998014,0.9994124851941415,-0.0023527155,0.85980946,0.9981228063933416,0.9993920957240323,0.0021153346,0.08053488,0.9983122084427253,0.9993417421686092,0.07239167,2.6473796,0.9422902189641795,0.9998255096296015,-0.029918168,-18.44805,0.9761323166853361,0.998901292435457,-0.021452557,-18.417543,0.9828846479876278,0.9994515175269552,-0.011279659,-18.393742,0.9910003250967939,0.9994694916208285
0610006L08Rik,-0.015597747,-15.159286,0.9875553033428832,0.9996994148075796,-0.015243248,-15.13933,0.9878381189626457,0.9998906553041256,-0.008116434,-14.795435,0.9935240935555751,0.9994124851941415,-0.0073064035,-14.765995,0.9941703851753109,0.9993920957240323,-0.0068988753,-14.752146,0.9944955375479378,0.9993417421686092,0.100005075,18.888618,0.9203402935001026,0.9998255096296015,-0.004986361,-14.696446,0.9960214758176429,0.998901292435457,-0.0035754263,-14.665947,0.9971472286106732,0.9994515175269552,-0.0018799432,-14.64215,0.9985000232597367,0.9994694916208285
0610009B22Rik,0.7292792,-0.015067068,0.46583086269639506,0.9070814087688549,0.42489842,0.18173021,0.67091072915639,0.9998906553041256,17.370018,1.0877438,1.3918130408174961e-67,6.801929840389262e-67,-6.5684495,-1.237505,5.084194721546539e-11,3.930798968247645e-10,-5.6669636,-0.42557448,1.4535041077956595e-08,5.6533712043729706e-08,-3.5668032,-0.5939982,0.0003613625764821466,0.001766427013499565,-7.15373,-1.7427195,8.445118618740649e-13,4.689924532441606e-12,-2.6011736,-0.66274893,0.009290541915973735,0.05767076032846401,1.7334439,1.2316034,0.08301681426158236,0.3860271115408991"""

data = pd.read_csv(io.StringIO(csv))
print(data["0_logfoldchanges"][0])   # -1.3434665000000001 instead of -1.3434665

This difference is tiny (less than a quadrillionth of the original value), and usually not visible because the display rounds it, so I would not in most contexts call this ‘significant’ (it’s likely to be insignificant in relation to the precision / accuracy of the input data), but does show up if you check your calculation by manually typing the same numbers into the Python interpreter.

To read the values more precisely, use float_precision="round_trip":

data = pd.read_csv(io.StringIO(csv), float_precision="round_trip")

Subtracting now produces the expected values (the same as doing a conventional python subtraction):

difference = data["0_logfoldchanges"] - data["1_logfoldchanges"]
print(difference[0] == -1.3434665 - -2.952315)  # checking first row - True

This is not due to floating point being imprecise as such, but is specific to the way pandas reads CSV files. This is a good guide to floating point rounding. In general, converting to integers will not help, except sometimes when dealing with money or other quantities that have a precise decimal representation.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement