Skip to content
Advertisement

Pandas dataframe manipulation/re-sizing of a single-column count file

I have a file that looks like this:

gRNA_A
gene_a
140626
gene_b
227598
gene_c
115781
gRNA_B
gene_a
125003
gene_b
102000
gene_c
200300

I want to read this into a pandas dataframe and re-shape it so that it looks like this:

        gene_a gene_b gene_c
gRNA_A  140626 227598 115781
gRNA_B  125003 102000 200300

Is this possible? If so, how?

Notes: it will not always be this size, so the solution needs to be size-independent. The input file will be max ~200gRNAs x 20genes. There will be gRNA_somelettercombos, but the gene will not be named gene_lettercombo– the gene will be the name of an actual gene (like GAPDH, ACTB, etc.).

Advertisement

Answer

You need to write a parser for your custom format, relying on the gRNA string to start a new group and then taking odd elements as key and even as value:

d = {}
current_rRNA = None
gene = None

with open('gRNA.txt') as f:
    for line in f:                    # iterate over lines
        line = line.strip()
        if not line:                  # skip blank lines
            continue
        if line.startswith('gRNA_'):  # start new group
            current_rRNA = line
            d[current_rRNA] = {}
        else:
            if gene:                  # even line of a group = data
                d[current_rRNA][gene] = int(line)
                gene = None
            else:                     # odd line of a group = gene name
                gene = line

df = pd.DataFrame.from_dict(d, orient='index')

output:

        gene_a  gene_b  gene_c
gRNA_A  140626  227598  115781
gRNA_B  125003  102000  200300
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement