Is there any way to make this script faster? I’m using one file to compare another file to print lines, if second column are equal.
JavaScript
x
21
21
1
import csv
2
output =[]
3
a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r')
4
list1 = a.readlines()
5
reader1 = a.read()
6
b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r')
7
list2 = b.readlines()
8
reader2 = b.read()
9
10
f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w')
11
12
for line1 in list1:
13
separar = line1.split("t")
14
gene = separar[2]
15
for line2 in list2:
16
separar2 = line2.split("t")
17
gene2 = separar2[2]
18
if gene == gene2:
19
print line1
20
f3.write(line1)
21
Input example (for both files):
JavaScript
1
6
1
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
2
3
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
4
5
1 14107321 rs187821037 C T 100 PASS AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout
6
The command line below works equally for same purpose in bash:
JavaScript
1
2
1
awk 'FNR==NR {a[$3]; next} $3 in a' Neandertais.vcf Phase1_missing.vcf > teste.vcf
2
How can I improve this Python script?
Advertisement
Answer
If you store your lines in dictionaries that are keyed by the column that you are interested in, you can easily use Python’s built-in set functions (which run at C speed) to find the matching lines. I tested a slightly modified version of this (filenames changed, and changed split('t')
to split()
because of stackoverflow formatting) and it seems to work fine:
JavaScript
1
35
35
1
import collections
2
3
# Use 'rb' to open files
4
5
infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
6
infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
7
outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'
8
9
def readfile(fname):
10
'''
11
Read in a file and return a dictionary of lines, keyed by the item in the second column
12
'''
13
results = collections.defaultdict(list)
14
# Read in binary mode -- it's quicker
15
with open(fname, 'rb') as f:
16
for line in f:
17
parts = line.split("t")
18
if not parts:
19
continue
20
gene = parts[2]
21
results[gene].append(line)
22
return results
23
24
dict1 = readfile(infn1)
25
dict2 = readfile(infn2)
26
27
with open(outfn, 'wb') as outf:
28
# Find keys that appear in both files
29
for key in set(dict1) & set(dict2):
30
# For these keys, print all the matching
31
# lines in the first file
32
for line in dict1[key]:
33
print(line.rstrip())
34
outf.write(line)
35