I am trying to read one chromosome sequence from a genome file in python. The format of the genome file is like the following but with more lines of sequence for each chromosome:
Chr1
ATCGTGTGATGGTGCGTAGATGCTGAT
GCTGATGTGTCGAGCGATGCTGAGTCG
Chr2
TGCGTGATGCTGAGCGATGCTGATGCT
TAGCTGACCACACACCTGTTTTGTAGG
Chr3
CAGTCGTAGCGATGCTGATGATGCTGA
GGTTGGTTGGCGGACCACCATTACTAT
I use the following code to read the whole genome sequence. However, I just want the sequence of one chromosome (e.g. whole sequence of Chr2). Rather than reading the whole genome, then searching the pattern for Chr2, is there any other way I could do this?
Thank you
with open("genome.txt") as f: for line in f: genome.append(line.rstrip())
Advertisement
Answer
Open the file and read line by line until you find ‘Chr2’.
Consume all non-empty lines until you reach EOF or any line beginning with ‘Chr’
def getgenomes(gfile): g = [] for line in gfile: if line.startswith('Chr'): break if (line := line.strip()): g.append(line) return g with open('genome.txt', encoding='utf-8') as gfile: genomes = None for line in gfile: if line.startswith('Chr2'): genomes = getgenomes(gfile) break print(genomes)
output:
['TGCGTGATGCTGAGCGATGCTGATGCT', 'TAGCTGACCACACACCTGTTTTGTAGG']