Skip to content
Advertisement

Reformatting a string to simulate a json using python regex

What I want to do is essentially reformat a string and make it pass the jsonschema validate function.

Simple enough at face value. However, the tricky part is that the string is being read in from a file and can vary in it’s appearance and formatting.

Example being

{
    key:"value",
    ...
}

OR

{
    "key":'value'
    ,... 
}

Or any possible combination of single quotes, double quotes, no quotes, arrays, strings etc and may contain an apostrophe etc etc.

I want to be liberal enough with the regex rules and treat the input data as unstructured, formatting the code passed at run time as uniform, in order to check it’s quality.

Using python, my approach so far has been to try and iterate over the most common transforms I wish to perform – however, I know that what I want to do is achievable, probably very obviously so and even though I haven’t completed it yet, I thought I’d try and save the headache and ask for some help.

A more thorough example I have is here (Have tried to demo most of the issues I’ve found so far):

{
    key_a:'my value',
    key_b:"my other value"
    ,nested_value_a:{
        sub_a: "value a",
        'sub_b':'value b',
        "sub_c":'isn't value b very interesting'
    }
    ,key_c:{ 
        sub_d: ["value"],
        sub_e:   ["value"]
        },
}

Output I want:

{
    "key_a":"my value",
    "key_b":"my other value"
    ,"nested_value_a":{
        "sub_a": "value a",
        "sub_b":"value b",
        "sub_c":"isn"t value b very interesting"
    }
    ,"key_c":{ 
        "sub_d": ["value"],
        "sub_e":   ["value"]
        }
}

I have tried this as my first step – but I’m convinced I’m going about this the hard way. I wanted to try and combine negative lookaheads and lookbehinds, so I could globally swap out all single quotes for double quotes, avoiding instances where a single quote was sandwiched in between 2 letters – but I don’t think I’m smart enough to get it on my own. All help much appreciated.

Thanks

Advertisement

Answer

You have a little issue with the word “isn’t” in considering the string

f = """{
    key_a:'my value',
    key_b:"my other value"
    ,nested_value_a:{
        sub_a: "value a",
        'sub_b':'value b',
        "sub_c":'isn't value b very interesting'
    }
    ,key_c:{ 
        sub_d: ["value"],
        sub_e:   ["value"]
        }
}"""

Now, you can do this about it:

f = f.replace("'",""")
f =  f.replace(""t","\'t")
print(f)

which will produce this string

f= {
    key_a:"my value",
    key_b:"my other value"
    ,nested_value_a:{
        sub_a: "value a",
        "sub_b":"value b",
        "sub_c":"isn\'t value b very interesting"
    }
    ,key_c:{ 
        sub_d: ["value"],
        sub_e:   ["value"]
        }
}

Not, the most beautiful thing. After this, you simply do the following (you need to import dirtyjson):

import dirtyjson

d = dirtyjson.loads(f)
json = json.dumps(d, sort_keys=False)

which gives you

{"key_a": "my value", "key_b": "my other value", "nested_value_a": {"sub_a": "value a", "sub_b": "value b", "sub_c": "isn't value b very interesting"}, "key_c": {"sub_d": ["value"], "sub_e": ["value"]}}
Advertisement