Python: How to encode DNA sequence using binary values?

Question

I would like to convert a file that contained few DNA sequences into binary values which is as follow: FileA.txt Desired output I have tried using this code to solve my problem but the bin output file seem failed to output my desired answer. Can anyone help me? Code Answer Do you want ascii output or binary? …

Accepted Answer

Do you want ascii output or binary? The below will give you what you show in your post (though on a single line. Code needs to be modified to keep newlines).import sysif len(sys.argv) != 2 : sys.stderr.write('Usage: {} n'.format(sys.argv[0])) sys.exit()# assumes the file only contains dna and newlinessequence = ''for line in open(sys.argv[1]) : sequence += line.strip().upper()sequence = sequence.replace('A', '1000')sequence = sequence.replace('C', '0100')sequence = sequence.replace('G', '0010')sequence = sequence.replace('T', '0001')outfile = open(sys.argv[1] + '.bin', 'wb')outfile.write(sequence)EDIT This creates a binary file where each nucleotide is a byte and the newlines are preserved in binary format.import sysif len(sys.argv) != 2 : sys.stderr.write('Usage: {} n'.format(sys.argv[0])) sys.exit()# assumes the file only contains dna and newlinesnewbytearray=bytearray(b'',encoding='utf-8')dict={'A':0b1000,'C':0b0100,'G':0b0010,'T':0b0001,'n':0b1010}with open(sys.argv[1]) as file: while True: char=file.read(1) if not char: file.close() break newbytearray.append(dict[char])outfile = open(sys.argv[1] + '.bin', 'wb')outfile.write(newbytearray)outfile.close()#Converts the binary file to unicode and prints the result sequence.testBin = open('fileA.txt.bin','rb')sequence=''for line in testBin: line = line.replace(chr(0b1000),'1000') line = line.replace(chr(0b0100),'0100') line = line.replace(chr(0b0010),'0010') line = line.replace(chr(0b0001),'0001') line = line.replace(chr(0b1010),'n') sequence += line#outputVerify = open('outputVerify.txt','wb')#outputVerify.write(sequence)#outputVerify.close()print sequencetestBin.close()#Shows the data of the binary file. Note that byte 6 is the newline character 0b1010.testBin = open('fileA.txt.bin','rb')list = ''i=0while True: b = testBin.read(1) i += 1 if not b: break #due to eof list += b print 'byte: ' + str(i) + ' is '+ '{0:04b}'.format(ord(b)) +' and has decimal representation: ' + str(ord(b))testBin.close()Edit 2 (moving my comment to the answer body): If this isn’t for an assignment or to interface with someone’s software, I recommend encoding your nucleotides as 0b00, 0b01,0b10, and 0b11 to save time and space. You can still use the 4-bit 0b1010 newline character to separate nucleotide sequences.

Advertisement

Answer