Skip to content
Advertisement

How to distinquish between floats, ints and scientific notation

I’m writing a custom json compresser. It is going to be reading numbers in all formats. How do I print the values of the json in the format, it is given, with json.load(). I would also want to preserve the type.

Example of a file it would have to read would be:

{"a":301, "b":301.0, "c":3.01E2, "d":"301", "e":"301.0", "f":"3.01E2"}

I would also want it to be able to distinquish between 1, 1.0 and true

When I do a basic for loop to print the values and their types with json.load(), it prints out

301 int
301.0 float
301.0 float
301 str
301.0 str
3.01E2 str

And yes, I understand that scientific notations are floats

Excpected output would be

301 int
301.0 float
3.01E2 float
301 str
301.0 str
3.01E2 str

Advertisement

Answer

So IIUC you want to keep the formatting of the json even if the value is given as float. I think the only way to do this is to change the type in your json i.e. adding quotes around float elements.

This can be done with regex:

import json
import re

data = """{"a":301, "b":301.0, "c":3.01E2, "d":"301", "e":"301.0", "f":"3.01E2", "g": true, "h":"hello"}"""

# the cricial part: enclosing float/int in quotes:
pattern = re.compile(r'(?<=:)s*([+-]?d+(?:.d*(?:E-?d+)?)?)b')
data_str = pattern.sub(r'"1"', data)

val_dict = json.loads(data) # the values as normally read by the json module
type_dict = {k: type(v) for k,v in val_dict.items()} # their types
repr_dict = json.loads(data_str) # the representations (everything is a sting there)

# using Pandas for pretty formatting
import pandas as pd
df = pd.DataFrame([val_dict, type_dict, repr_dict], index=["Value", "Type", "Repr."]).T

Output:

    Value             Type   Repr.
a     301    <class 'int'>     301
b   301.0  <class 'float'>   301.0
c   301.0  <class 'float'>  3.01E2
d     301    <class 'str'>     301
e   301.0    <class 'str'>   301.0
f  3.01E2    <class 'str'>  3.01E2
g    True   <class 'bool'>    True
h   hello    <class 'str'>   hello

So here the details of the regex:

  • ([+-]?d+(?:.d*(?:E-?d+)?)?) this is our matching group, consisting of:
    • [+-]? optional leading + or –
    • d+ one or more digits, followed by (optionally):
    • (?:.d*(?:E-?d+)?)?: non capturing group made of
      • . a dot
      • d* zero or more digits
      • (optionally) an E with an (optional) minus - followed by one or more digits d+
  • b specifie a word boundary (so that the match doesn’t cut a series of digits)
  • (?<=:) is a lookbehind, ensuring the expression is directly preceeded by : (we don’t add quotes around existing strings)
  • s* any white character before the expression is ignored/removed

1 is a back reference to our (1st) group. So we replace the whole match with "1"

Edit: slightly changed the regex to replace numbers directly following : and taking leading +/- into account

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement