How to split a string in python based on separator with separator as a part of one of the chunks?

Tags: ,



Looking for an elegant way to:

  1. Split a string based on a separator
  2. Instead of discarding separator, making it a part of the splitted chunks.

For instance I do have date and time data like:

D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30

Sometimes there’s D, sometimes not (however I always want it to be a part of first chunk), no trailing or leading zeros for time and timezone only have ‘:’ sometimes. Point is, it is necessary to split on these ‘D, T, +’ characters cause the segements might not follow the sae length. If they were it would be easier to just split on the index basis. I want to split them over multiple characters like T and + and have them a part of the data as well like:

['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']

I know a nicer way would be to clean data first and normalize all rows to follow same pattern but just curious how to do it as it is

For now on my ugly solution looks like:

[i+j for _, i in enumerate(['D','T','TZ']) for __, j in enumerate('D2018-4-21T3:55+6'.replace('T',' ').replace('D', ' ').replace('+', ' +').split()) if _ == __]

Answer

Use a regular expression

Reference: https://docs.python.org/3/library/re.html

(…)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the number special sequence, described below. To match the literals ‘(‘ or ‘)’, use ( or ), or enclose them inside a character class: [(], [)].

import re
a = '''D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30'''

b = a.splitlines()
for i in b:
    m = re.search(r'^D?(.*)([T].*?)([-+].*)$', i)
    if m:
        print(["D%s" % m.group(1), m.group(2), "TZ%s" % m.group(3)])

Result:

['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']


Source: stackoverflow