Python csv: Split column to columns and then to rows by delimiter

Tags: ,



I have a column in a csv file which contains person’s details in this format:

+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  Team  |                                                                                                Members                                                                                                 |
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Team 1 | OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                         |
| Team 2 | OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes                                                                                                             |
| Team 3 | OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46 |
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Actual csv format:

"Team", "Members"                                                                                                 
 Team 1, OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                         
 Team 2, OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes                                                                                                             
 Team 3, OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46

I want to split them in a new csv file like this:

+-------+-------------+-------------+----------------+------------------+---------------+---------------+--------------+
| Team  | Member_Rank | Member_Name | Member_Surname | Member_ID_Method | Member_ID_Num | Member_Gender | Member_Notes |
+-------+-------------+-------------+----------------+------------------+---------------+---------------+--------------+
| Team1 | OK-10       | Jason       | Jones          | ID No            |      00000000 | male          | my notes     |
| Team2 | OK-10       | Mike        | James          | ID No            |      00000001 | male          | my notes     |
| Team2 | OZ-09       | John        | Rick           | ID No            |      00000002 | male          | my notes     |
+-------+-------------+-------------+----------------+------------------+---------------+---------------+--------------+

Splitting details:

Split Row Delimiter : ' O&-' where & can be only 'K' or 'Z'

Split Column Delimiter : ':' ,columns number in new csv file is fixed

(One Team can contain many members, there is no upper limit)

UPDATE

By using this code provided by @Adirio I get only the last member from fields with multiple members:

import csv
import re


members_split_regex = re.compile(r'(O[KZ]-d+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+)(?= O[KZ]|$)')

with open('test.csv') as input_file, open('output_csv.csv', 'w', newline='') as output_file:
    csv_reader = csv.DictReader(input_file)
    fieldnames = csv_reader.fieldnames.copy()
    fieldnames.remove('Members')
    csv_writer = csv.DictWriter(output_file, extrasaction='ignore', fieldnames=fieldnames + ['Member_Rank', 'Member_Name', 'Member_Surname', 'Member_ID_Method', 'Member_ID_Num', 'Member_Gender', 'Member_Notes'])
    csv_writer.writeheader()
    for row in csv_reader:
        for member_tuple in members_split_regex.findall(row['Members']):
                member_dict = {}
                (
                    member_dict['Member_Rank'],
                    member_dict['Member_Name'],
                    member_dict['Member_Surname'],
                    member_dict['Member_ID_Method'],
                    member_dict['Member_ID_Num'],
                    member_dict['Member_Gender'],
                    member_dict['Member_Notes']
                ) = member_tuple
                print(row['Members'])
                print(member_tuple)
                member_dict.update(row)
                csv_writer.writerow(member_dict)

print results:

row[‘Members’] ->

OK-1:name1:sunrmae2:ID No:id1233123:male:note12 OK-10:name2:sunrame2:Passport No:asda3243242:female:note2 OZ-1:nma3:surname3:Passport No:asd213131:other:note 56

print(member_tuple) ->

(‘OZ-1’, ‘nma3’, ‘surname3’, ‘Passport No’, ‘asd213131’, ‘other’, ‘note 56’)

Answer

Based on @DeepSpace answer but with a fixed regex and new requirements added:

import csv
import re


members_split_regex = re.compile(r'(O[KZ]-d+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+)(?= O[KZ]|$)')

with open('test.csv') as input_file, open('output_csv', 'w', newline='') as output_file:
    csv_reader = csv.DictReader(input_file)
    fieldnames = csv_reader.fieldnames.copy()
    fieldnames.remove('Members')
    csv_writer = csv.DictWriter(output_file, extrasaction='ignore', fieldnames=fieldnames + ['Member_Rank', 'Member_Name', 'Member_Surname', 'Member_ID_Method', 'Member_ID_Num', 'Member_Gender', 'Member_Notes'])
    csv_writer.writeheader()
    for row in csv_reader:
        for member_tuple in members_split_regex.findall(row['Members']):
            member_dict = {}
            (
                member_dict['Member_Rank'],
                member_dict['Member_Name'],
                member_dict['Member_Surname'],
                member_dict['Member_ID_Method'],
                member_dict['Member_ID_Num'],
                member_dict['Member_Gender'],
                member_dict['Member_Notes']
            ) = member_tuple
            member_dict.update(row)
            csv_writer.writerow(member_dict)

The main difference is that I’m deleting the column from the dictionary so that we can use it to update our new dictionary. This way we do not only copy the “Team” column but any other column that is not “Members”. To do so the fieldnames of the reader are also copied, the “Members” item removed, and the new ones added to the fieldnames of the writter.

The used regex doesn’t hardcode any field, allows spaces in names and surnames, capital Os in the notes, and ID fields that are not just 8-digit numbers.



Source: stackoverflow