I have a column in a csv file which contains person’s details in this format:
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Team | Members | +--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Team 1 | OK-10:Jason:Jones:ID No:00000000:male:my notes | | Team 2 | OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes | | Team 3 | OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46 | +--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Actual csv format:
"Team", "Members" Team 1, OK-10:Jason:Jones:ID No:00000000:male:my notes Team 2, OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes Team 3, OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46
I want to split them in a new csv file like this:
+-------+-------------+-------------+----------------+------------------+---------------+---------------+--------------+ | Team | Member_Rank | Member_Name | Member_Surname | Member_ID_Method | Member_ID_Num | Member_Gender | Member_Notes | +-------+-------------+-------------+----------------+------------------+---------------+---------------+--------------+ | Team1 | OK-10 | Jason | Jones | ID No | 00000000 | male | my notes | | Team2 | OK-10 | Mike | James | ID No | 00000001 | male | my notes | | Team2 | OZ-09 | John | Rick | ID No | 00000002 | male | my notes | +-------+-------------+-------------+----------------+------------------+---------------+---------------+--------------+
Splitting details:
Split Row Delimiter : ' O&-'
where & can be only 'K'
or 'Z'
Split Column Delimiter : ':'
,columns number in new csv file is fixed
(One Team can contain many members, there is no upper limit)
UPDATE
By using this code provided by @Adirio I get only the last member from fields with multiple members:
import csv import re members_split_regex = re.compile(r'(O[KZ]-d+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+)(?= O[KZ]|$)') with open('test.csv') as input_file, open('output_csv.csv', 'w', newline='') as output_file: csv_reader = csv.DictReader(input_file) fieldnames = csv_reader.fieldnames.copy() fieldnames.remove('Members') csv_writer = csv.DictWriter(output_file, extrasaction='ignore', fieldnames=fieldnames + ['Member_Rank', 'Member_Name', 'Member_Surname', 'Member_ID_Method', 'Member_ID_Num', 'Member_Gender', 'Member_Notes']) csv_writer.writeheader() for row in csv_reader: for member_tuple in members_split_regex.findall(row['Members']): member_dict = {} ( member_dict['Member_Rank'], member_dict['Member_Name'], member_dict['Member_Surname'], member_dict['Member_ID_Method'], member_dict['Member_ID_Num'], member_dict['Member_Gender'], member_dict['Member_Notes'] ) = member_tuple print(row['Members']) print(member_tuple) member_dict.update(row) csv_writer.writerow(member_dict)
print results:
row[‘Members’] ->
OK-1:name1:sunrmae2:ID No:id1233123:male:note12 OK-10:name2:sunrame2:Passport No:asda3243242:female:note2 OZ-1:nma3:surname3:Passport No:asd213131:other:note 56
print(member_tuple) ->
(‘OZ-1’, ‘nma3’, ‘surname3’, ‘Passport No’, ‘asd213131’, ‘other’, ‘note 56’)
Advertisement
Answer
Based on @DeepSpace answer but with a fixed regex and new requirements added:
import csv import re members_split_regex = re.compile(r'(O[KZ]-d+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+):([a-zA-Z0-9 ]+)(?= O[KZ]|$)') with open('test.csv') as input_file, open('output_csv', 'w', newline='') as output_file: csv_reader = csv.DictReader(input_file) fieldnames = csv_reader.fieldnames.copy() fieldnames.remove('Members') csv_writer = csv.DictWriter(output_file, extrasaction='ignore', fieldnames=fieldnames + ['Member_Rank', 'Member_Name', 'Member_Surname', 'Member_ID_Method', 'Member_ID_Num', 'Member_Gender', 'Member_Notes']) csv_writer.writeheader() for row in csv_reader: for member_tuple in members_split_regex.findall(row['Members']): member_dict = {} ( member_dict['Member_Rank'], member_dict['Member_Name'], member_dict['Member_Surname'], member_dict['Member_ID_Method'], member_dict['Member_ID_Num'], member_dict['Member_Gender'], member_dict['Member_Notes'] ) = member_tuple member_dict.update(row) csv_writer.writerow(member_dict)
The main difference is that I’m deleting the column from the dictionary so that we can use it to update our new dictionary. This way we do not only copy the “Team” column but any other column that is not “Members”. To do so the fieldnames of the reader are also copied, the “Members” item removed, and the new ones added to the fieldnames of the writter.
The used regex doesn’t hardcode any field, allows spaces in names and surnames, capital Os in the notes, and ID fields that are not just 8-digit numbers.