I have a file (~50,000 lines) text.txt as below, which contains some gene info from five individuals (AB, BB, CA, DD, GG). The t in the file is a tab seperator. There are also a lot of info that are not useful in the file, and I would like to clean it up. So What I need is to extract the species name with ‘transcript=’ id, if they have one, and also extract the ‘DD:’ and ‘GG:’ parts.
$head text.txt
GeneAtAB:xrbyk | jdnif | otherTexttBB:abdf | jdhfkc | otherDifferentTexttCA:bdmf | nfjvks | transcript=aaabb.1tDD:hudnf.1 type=cdstGG:jdubf.1 type=cds GeneBtBB:dfsfg | dfsfvdf | otherDifferenttCA:zdcfsdf | xfgdfgs | transcript=sdfs.1tDD:sdfsw.1 type=cdstGG:fghfg.1 type=cds GeneCtAB:dsfsdf | xdvv | otherText1tBB:xdsd | sdfsdf | otherDifferentText2tDD:hudnf.1 type=cdstGG:jdubf.1 type=cds GeneDtAB:dfsdf | Asda | transcript=asdasd.2tCA:bdmf | nfjvks | transcript=aaabb.1tDD:hudnf.1 type=cdstGG:jdubf.1 type=cds
and I would like the output to be
GeneAtCA:transcript=aaabb.1tDD:hudnf.1tGG:jdubf.1 GeneBtCA:transcript=sdfs.1tDD:sdfsw.1tGG:fghfg.1 GeneCtDD:hudnf.1tGG:jdubf.1 GeneDtAB:transcript=asdasd.2tCA:transcript=aaabb.1tDD:hudnf.1tGG:jdubf.1
I have been searching and thinking for a whole afternoon already, and only have the idea of tearing this file apart by species with the first column Gene name, then separately modify each file, and finally merge files together according to the gene name. But as you see, each line of the file does not necessary have the same number of fields, and so I can’t simply use awk
to print a certain column. I’m not sure how I can tear them up by species.
I tried to mimic the use of this one How to use sed/grep to extract text between two words?, but did not come with success. I also read a bit about Python in how to split text, (as I’m starting to learn this language), but still can’t figure it out. Could anyone please help? Thanks a lot!
UPDATE OF CLARIFICATION OF THE INPUT DATA: In the example showed above, the gene info of each individual is separated by tab (t), which means that all the text after the inidividual name plus colon (e.g. AB:) belongs to the individual (e.g. “xrbyk | jdnif | otherText” for AB). Whether to keep the individual in the final output depends on whether there is the information of “transcript=” for that individual, except for DD and GG. This is why in the final output the 1st line start with CA but not with AB.
Advertisement
Answer
This solution is a bit long, but should be easy to work with:
#!/usr/bin/env python3 # main.py import csv import fileinput import re def filter_fields(row): output = [] for field_number, field in enumerate(row, 1): if field_number == 1: output.append(field) elif "DD:" in field or "GG:" in field: output.append(field.split()[0]) elif "transcript=" in field: # Remove stuff from after the colon to the last space output.append(re.sub(r":.* ", ":", field)) return "t".join(output) reader = csv.reader(fileinput.input(), delimiter="t") for row in reader: print(filter_fields(row))
How to run it:
# Output to the screen python3 main.py text.txt # Output to a file python3 main.py text.txt > out.txt # Use as a filter cat text.txt | python3 main.py
Notes
- In this solution, each line of text is broken into a row of fields.
- The function
filter_fields
will take each row, decide what field to to keep and reformat. It then return those fields, tab separated. - The
re.sub(...)
call says: Delete everything after the colon, up to the last space.