Skip to content
Advertisement

Is there anyway to convert specific text data to csv format and give Header names in python?

I have this format of the dataset in a text file.

Here the dataset link is https://drive.google.com/file/d/1RqU2s0dqjd60dcYlxEJ8vnw9_z2fWixd/view?usp=sharing

PMID- 20301691
STAT- Publisher
DA  - 20100320
DRDT- 20210311
CTDT- 20000204
PB  - University of Washington, Seattle
DP  - 1993
TI  - Classic Galactosemia and Clinical Variant Galactosemia
BTI - GeneReviews((R))
AB  - CLINICAL CHARACTERISTICS: The term "galactosemia" refers to disorders of
      galactose metabolism that include classic galactosemia, clinical variant
      galactosemia, and biochemical variant galactosemia (not covered in this chapter).
      This GeneReview focuses on: Classic galactosemia, which can result in
      life-threatening complications including feeding problems, failure to thrive,
      hepatocellular damage, bleeding, and E coli sepsis in untreated infants. If a
      lactose-restricted diet is provided during the first ten days of life, the
      neonatal signs usually quickly resolve and the complications of liver failure,
      sepsis, and neonatal death are prevented; however, despite adequate treatment
      from an early age, children with classic galactosemia remain at increased risk
      for developmental delays, speech problems (termed childhood apraxia of speech and
      dysarthria), and abnormalities of motor function. Almost all females with classic
      galactosemia manifest hypergonadatropic hypogonadism or premature ovarian
      insufficiency (POI). Clinical variant galactosemia, which can result in
      life-threatening complications including feeding problems, failure to thrive,
      hepatocellular damage including cirrhosis, and bleeding in untreated infants.
      This is exemplified by the disease that occurs in African Americans and native
      Africans in South Africa. Persons with clinical variant galactosemia may be
      missed with newborn screening as the hypergalactosemia is not as marked as in
      classic galactosemia and breath testing is normal. If a lactose-restricted diet
      is provided during the first ten days of life, the severe acute neonatal
      complications are usually prevented. African Americans with clinical variant
      galactosemia and adequate early treatment do not appear to be at risk for
      long-term complications, including POI. DIAGNOSIS/TESTING: The diagnosis of
      classic galactosemia and clinical variant galactosemia is established by
      detection of elevated erythrocyte galactose-1-phosphate concentration, reduced
      erythrocyte galactose-1-phosphate uridylyltranserase (GALT) enzyme activity,
      and/or biallelic pathogenic variants in GALT. In classic galactosemia,
      erythrocyte galactose-1-phosphate is usually >10 mg/dL and erythrocyte GALT
      enzyme activity is absent or barely detectable. In clinical variant galactosemia,
      erythrocyte GALT enzyme activity is close to or above 1% of control values but
      probably never >10%-15%. However, in African Americans with clinical variant
      galactosemia, the erythrocyte GALT enzyme activity may be absent or barely
      detectable but is often much higher in liver and in intestinal tissue (e.g., 10% 
      of control values). Virtually 100% of infants with classic galactosemia or
      clinical variant galactosemia can be detected in newborn screening programs that 
      include testing for galactosemia in their panel. However, infants with clinical
      variant galactosemia may be missed if the program only measures blood total
      galactose level and not erythrocyte GALT enzyme activity. MANAGEMENT: Treatment
      of manifestations: Standard of care in any newborn who is "screen-positive" for
      galactosemia is immediate dietary intervention while diagnostic testing is under 
      way. Once a diagnosis is confirmed, restriction of galactose intake is continued 
      and all milk products are replaced with lactose-free formulas (e.g., Isomil((R)) 
      or Prosobee((R))) containing non-galactose carbohydrates; dietary restrictions on
      all lactose-containing foods and other dairy products should continue throughout 
      life, although management of the diet becomes less important after infancy and
      early childhood. In rare instances, cataract surgery may be needed in the first
      year of life. Childhood apraxia of speech and dysarthria require expert speech
      therapy. Developmental assessment at age one year by a psychologist and/or
      developmental pediatrician is recommended in order to formulate a treatment plan 
      with the speech therapist and treating physician. For school-age children, an
      individual education plan and/or professional help with learning skills and
      special classrooms as needed. Hormone replacement therapy as needed for delayed
      pubertal development and/or primary or secondary amenorrhea. Stimulation with
      follicle-stimulating hormone may be useful in producing ovulation in some women. 
      Prevention of secondary complications: Recommended calcium, vitamin D, and
      vitamin K intake to help prevent decreased bone mineralization; standard
      treatment for gastrointestinal dysfunction. Surveillance: Biochemical genetics
      clinic visits every three months for the first year of life or as needed
      depending on the nature of the potential acute complications; every six months
      during the second year of life; yearly thereafter. Routine monitoring for: the
      accumulation of toxic analytes (e.g., erythrocyte galactose-1-phosphate and
      urinary galactitol); cataracts; speech and development; movement disorder; POI;
      nutritional deficiency; and osteoporosis. Agents/circumstances to avoid: Breast
      milk, proprietary infant formulas containing lactose, cow's milk, dairy products,
      and casein or whey-containing foods; medications with lactose and galactose.
      Evaluation of relatives at risk: To allow for earliest possible diagnosis and
      treatment of at-risk sibs: Perform prenatal diagnosis when the GALT pathogenic
      variants in the family are known; or If prenatal testing has not been performed, 
      test the newborn for either the family-specific GALT pathogenic variants or
      erythrocyte GALT enzyme activity. Pregnancy management: Women with classic
      galactosemia should maintain a lactose-restricted diet during pregnancy. GENETIC 
      COUNSELING: Classic galactosemia and clinical variant galactosemia are inherited 
      in an autosomal recessive manner. Couples who have had one affected child have a 
      25% chance of having an affected child in each subsequent pregnancy. Molecular
      genetic carrier testing for at-risk sibs and prenatal testing for pregnancies at 
      increased risk are an option if the GALT pathogenic variants in the family are
      known. If the GALT pathogenic variants in a family are not known, prenatal
      testing can rely on assay of GALT enzyme activity in cultured amniotic fluid
      cells.
CI  - Copyright (c) 1993-2021, University of Washington, Seattle. GeneReviews is a
      registered trademark of the University of Washington, Seattle. All rights
      reserved.
FED - Adam, Margaret P
ED  - Adam MP
FED - Ardinger, Holly H
ED  - Ardinger HH
FED - Pagon, Roberta A

I want to give the left side value as column name and right side values will be a row format.

Output should be

PMID       STAT        DA         CTDT
33237688   Publisher   20201126   20201125

I have tried with text to CSV but not working

  import pandas as pd

  medical = pd.read_csv("sepsis2015.txt",
                         sep="n")
  print(medical)

Advertisement

Answer

The simplest way I know is:

  • read data file with:

    with open("sepsis2015.txt") as file:
        lines = file.readlines()
    lines = ''.join(lines).split('nn')
    

    this will give you a list of your records:

    ['PMID- 20301691 nSTAT- PublishernDA  - 20100320nDRDT- 20210311nCTDT- 20000204nPB  - University of Washington, SeattlenDP  - 1993nTI  - Classic Galactosemia and Clinical Variant GalactosemianBTI - GeneReviews((R))', 'nPMID- 33237688nSTAT- PublishernDA  - 20201126nCTDT- 20201125nPB  - University of Washington, SeattlenDP  - 1993nTI  - MIRAGE SyndromenBTI - GeneReviews((R))']
    
  • convert data stored in lines list into a data dictionary:

    data = {i: {item.split('-')[0].replace(' ', ''): item.split('-')[1][1:] for item in row.split('n') if '-' in item} for i, row in enumerate(lines)}
    

    so you have:

    {0: {'PMID': '20301691', 'STAT': 'Publisher', 'DA': '20100320', 'DRDT': '20210311', 'CTDT': '20000204', 'PB': 'University of Washington, Seattle', 'DP': '1993', 'TI': 'Classic Galactosemia and Clinical Variant Galactosemia', 'BTI': 'GeneReviews((R))'}, 1: {'PMID': '33237688', 'STAT': 'Publisher', 'DA': '20201126', 'CTDT': '20201125', 'PB': 'University of Washington, Seattle', 'DP': '1993', 'TI': 'MIRAGE Syndrome', 'BTI': 'GeneReviews((R))'}}
    
  • finally, convert this dictionary into a pandas.DataFrame with:

    df = pd.DataFrame.from_dict(data, orient = 'index')
    

Complete Code

import pandas as pd

with open(r'data/data.csv') as file:
    lines = file.readlines()
lines = ''.join(lines).split('nn')

data = {i: {item.split('-')[0].replace(' ', ''): item.split('-')[1][1:] for item in row.split('n') if '-' in item} for i, row in enumerate(lines)}
print(data)
df = pd.DataFrame.from_dict(data, orient = 'index')
       PMID       STAT        DA      DRDT      CTDT                                 PB    DP                                                      TI               BTI
0  20301691  Publisher  20100320  20210311  20000204  University of Washington, Seattle  1993  Classic Galactosemia and Clinical Variant Galactosemia  GeneReviews((R))
1  33237688  Publisher  20201126       NaN  20201125  University of Washington, Seattle  1993                                         MIRAGE Syndrome  GeneReviews((R))
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement