I am trying to create a matrix from a data frame and a list. The list and column 1 of the data frame contain the same strings, however, not all of the strings in the list are in the column 1 and are not in the same order (see example below). I would like to search through the data frame, and print the data in the second column if the string in column 1 matches a string in the list, else print the string in seqList and 0
, NaN
or missing
etc. I thought that pandas
would be good for this as I can compare columns in a data frame using df.equals
, but it reports false
even when the strings are present and should match.
I think this may be because I have more strings in the seqList than in the data frame and they’re not in the same order. I therefore, tried to index the data frame, but my data in column 2 is lost/replaced with NaN
.
List
seqList = ['Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1', 'Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1', 'Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1', 'Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1', 'Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1']
Infile for dataframe
#Infile2: Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1 1 Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1 3 Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1 3
I want to create a new matrix which contains all of the sequences in the list (seqList) and the number of occurrences identified in infile2.
Desired output
#outfile: sequence hits Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1 3 Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1 1 Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1 NaN Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1 3 Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1 NaN
I have loaded infile2 as a dataframe and named columns:
#Create the dataframe from the sequnce hits in the genomes (identified in the occurances file). Occurences=pd.read_csv(infile2, delimiter='t', index_col=False) #Read the input file as a tab separated dataframe. pd.set_option("display.max_colwidth", None) #Ensure that the sequence names are not cut off. Occurences.rename(columns = {list(Occurences)[0]: 'sequence'}, inplace = True) #Name the sequences column Occurences.rename(columns = {list(Occurences)[1]: 'hits'}, inplace = True) #Name the occurences column
I have tried to convert seqList to a data frame and then use .equals (as shown here) but this still reports the match as false:
SeqDataFrame= pd.DataFrame (seqList, columns = ['sequence']) #Load seqList as df result = SeqDataFrame['sequence'].equals(Occurences['sequence']) #Use .equals to compare the sequence columns and report matching print(result) False
I think that the issue is that the order of strings in the sequence column in the occurrences df is not in the same order as seqList. I have therefore tried to index the occurrences data frame using seqList, but this seems to lose all of the data in the hits column.
Occurences.set_index('sequence', inplace=True) Occurences = Occurences.reindex(seqList) print(Occurences) hits sequence Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1 NaN Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1 NaN Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1 NaN Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1 NaN Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1 NaN
I have looked for similar questions, but none seem to have an issue with the order of the columns not matching. And if it is a question specifically about columns not matching, they have reindexed as I have and haven’t lost data. How do I create my desired matrix which contains all of the sequences in seqList and the number of hits identified in the Occurences data frame?
Many thanks in advance
n.b. I have also tried to use pd.merge to merge the list and data frame, but for some reason this creates an empty data frame:
MergedFrames = pd.merge(SeqDataFrame, Occurences, left_on=["sequence"], right_on=['sequence']) print("MergedFrames") print(MergedFrames) MergedFrames Empty DataFrame Columns: [sequence, hits] Index: []
Advertisement
Answer
You can use DataFrame.reindex
:
Occurences.set_index('sequence').reindex(seqList).reset_index()
sequence hits 0 Cand_Eff_1_MLAELSVAFTLAAFALA_rc_1 3.0 1 Cand_Eff_2_MTRFHLILLPLLFSWFSYCFG_1 1.0 2 Cand_Eff_3_MAMSRFVVTLGLCVSASA_rc_1 NaN 3 Cand_Eff_4_MAPYSMVLLGALSILGFGAYA_rc_1 3.0 4 Cand_Eff_5_MPVLQVVVVVVAMAVVKVVMV_rc_1 NaN
If your list can have duplicates just use list(set(seqList))
.