I would like to convert a file that contained few DNA sequences into binary values which is as follow:
A=1000 C=0100 G=0010 T=0001
FileA.txt
CCGAT GCTTA
Desired output
01000100001010000001 00100100000100011000
I have tried using this code to solve my problem but the bin output file seem failed to output my desired answer. Can anyone help me?
Code
import sys if len(sys.argv) != 2 : sys.stderr.write('Usage: {} <nucleotide file>n'.format(sys.argv[0])) sys.exit() # assumes the file only contains dna and newlines sequence = '' for line in open(sys.argv[1]) : sequence += line.strip().upper() sequence = sequence.replace('A', chr(0b1000)) sequence = sequence.replace('C', chr(0b0100)) sequence = sequence.replace('G', chr(0b0010)) sequence = sequence.replace('T', chr(0b0001)) outfile = open(sys.argv[1] + '.bin', 'wb') outfile.write(bytearray(sequence, encoding = 'utf-8'))
Advertisement
Answer
Do you want ascii output or binary? The below will give you what you show in your post (though on a single line. Code needs to be modified to keep newlines).
import sys if len(sys.argv) != 2 : sys.stderr.write('Usage: {} <nucleotide file>n'.format(sys.argv[0])) sys.exit() # assumes the file only contains dna and newlines sequence = '' for line in open(sys.argv[1]) : sequence += line.strip().upper() sequence = sequence.replace('A', '1000') sequence = sequence.replace('C', '0100') sequence = sequence.replace('G', '0010') sequence = sequence.replace('T', '0001') outfile = open(sys.argv[1] + '.bin', 'wb') outfile.write(sequence)
EDIT This creates a binary file where each nucleotide is a byte and the newlines are preserved in binary format.
import sys if len(sys.argv) != 2 : sys.stderr.write('Usage: {} <nucleotide file>n'.format(sys.argv[0])) sys.exit() # assumes the file only contains dna and newlines newbytearray=bytearray(b'',encoding='utf-8') dict={'A':0b1000,'C':0b0100,'G':0b0010,'T':0b0001,'n':0b1010} with open(sys.argv[1]) as file: while True: char=file.read(1) if not char: file.close() break newbytearray.append(dict[char]) outfile = open(sys.argv[1] + '.bin', 'wb') outfile.write(newbytearray) outfile.close() #Converts the binary file to unicode and prints the result sequence. testBin = open('fileA.txt.bin','rb') sequence='' for line in testBin: line = line.replace(chr(0b1000),'1000') line = line.replace(chr(0b0100),'0100') line = line.replace(chr(0b0010),'0010') line = line.replace(chr(0b0001),'0001') line = line.replace(chr(0b1010),'n') sequence += line #outputVerify = open('outputVerify.txt','wb') #outputVerify.write(sequence) #outputVerify.close() print sequence testBin.close() #Shows the data of the binary file. Note that byte 6 is the newline character 0b1010. testBin = open('fileA.txt.bin','rb') list = '' i=0 while True: b = testBin.read(1) i += 1 if not b: break #due to eof list += b print 'byte: ' + str(i) + ' is '+ '{0:04b}'.format(ord(b)) +' and has decimal representation: ' + str(ord(b)) testBin.close()
Edit 2 (moving my comment to the answer body): If this isn’t for an assignment or to interface with someone’s software, I recommend encoding your nucleotides as 0b00, 0b01,0b10, and 0b11 to save time and space. You can still use the 4-bit 0b1010 newline character to separate nucleotide sequences.