I have a dataframe that follows this format:
df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'], 'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'], 'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'], 'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'], 'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'], 'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'], 'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'], 'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})
It is much larger (it has about 1000 genes, i.e., columns). Each number corresponds to an mRNA abundance value.
I need to compare AC and SCC subtypes for each gene using the Wilcoxon rank sum test. I need to do this for every gene in my dataset, so I essentially need to do this 1000 times. Where group1 is the mRNA values for the AC subtype for a gene and group2 is the mRNA values for the SCC subtype for the same gene.
import scipy.stats ranksums(group1, group2)
I need to create a for loop that will compare mRNA values using the rank sum test between two subtypes/groups: AC and SCC, and generate a list of p-values. I essentially need to do the wilcoxon rank sum test 1000 times to generate a long list of p-values that I have computed for each gene (there are 1000 of them, each column is a gene) comparing AC vs. SCC.
How can I achieve this in python? This is what I have tried with no luck.
p_vals= [] for i in range(1000): new_data = subset.copy() permuted_labels = list(subset['subtype'].sample(n=subset.shape[0], replace=False)) new_data['subtype'] = permuted_labels group1= new_data.loc[new_data.subtype == 'AC'] group2= new_data.loc[new_data.subtype == 'SCC'] ranksums= ranksums(group1, group2) p_vals.append(ranksums)
print(p_vals)
I need to do something similar, but instead of calculating a p-value I need to calculate the fold-change (FC) of mean mRNA abundances between the AC and SCC subtypes for every gene (using the AC value in the numerator of FC). I need to combine gene FC and p-values from the rank sum test into a single table. In addition I also need to add to this table a column for the corrected p-values using
from statsmodels.stats.multitest import fdrcorrection fdrcorrection(list_of_pvalues, alpha=0.05, method='indep', is_sorted=False)
def geneFC(df, geneColumnName): # function to return fold change for every gene in the matrix ac = df[(df['subtype'] == 'AC')] scc = df[(df['subtype'] == 'SCC')] acGene = ac[geneColumnName] sccGene = scc[geneColumnName] return acGene.mean()/sccGene.mean() genes = list(df.columns) # list of genes from df columns genes.remove('subtype') # removes "subtype" from list fc_values = [] # list of pvalues to fill for gene in genes: # loops through list of genes fc_values.append(geneFC(df, gene)) # adds FC value of gene to list
Advertisement
Answer
I think I have a working solution, though I’m not sure why the pvalues it returns are all the exact same. Is that a property of the data you provided?
import pandas as pd from scipy.stats import ranksums df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'], 'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'], 'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'], 'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'], 'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'], 'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'], 'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'], 'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']}) def geneRankSum(df, geneColumnName): # function to return rank sum for given gene ac = df[(df['subtype'] == 'AC')] scc = df[(df['subtype'] == 'SCC')] acGene = ac[geneColumnName] sccGene = scc[geneColumnName] return ranksums(acGene, sccGene).pvalue genes = list(df.columns) # list of genes from df columns genes.remove('subtype') # removes "subtype" from list pvalues = [] # list of pvalues to fill for gene in genes: # loops through list of genes pvalues.append(geneRankSum(df, gene)) # adds pvalue of gene to list def geneFC(df, geneColumnName): # function to return fold change for every gene in the matrix ac = df[(df['subtype'] == 'AC')] scc = df[(df['subtype'] == 'SCC')] acGene = ac[geneColumnName] sccGene = scc[geneColumnName] return acGene.mean()/sccGene.mean() genes = list(df.columns) # list of genes from df columns genes.remove('subtype') # removes "subtype" from list data = df[genes].astype(float) data['subtype'] = df['subtype'] fc_values = [] # list of pvalues to fill for gene in genes: # loops through list of genes fc_values.append(geneFC(data, gene)) # adds FC value of gene to list print(fc_values)