Skip to content
Advertisement

Extract specific symbols from pandas cells, then replace them with values from a dict where they are keys

My data looks like this, well these are two of the columns:

Index  MSC Primary   MSC Secondary
0       05C25      (05C20 20F05)
1       20-04      (20F55 62Qxx)
2       13F20      (17B20 22E40 22F30 65Fxx)
3       05Exx      (20-04 20H20)
4       20G40      (05C25)

These are MSC codes, corresponding to different areas of science. I need to replace each code with the corresponding subject from this dict here: https://mathscinet.ams.org/msnhtml/msc2020.pdf , some of them are: “”” 00 General and overarching topics; collections 01 History and biography 03 Mathematical logic and foundations 05 Combinatorics “””

First I need to isolate the first two digits from each code, for instance 05C25 to transform to 05 or from the second column (05E15 14H50) to transform to 05, 14.

Then I need each number replaced by the corresponding science for example 05, 14 to become Combinatorics, Algebraic geometry. This is all tricky form be because I am new to Python and the second column has different number of MSC codes in each cell so I cannot use indexing there.

I know for the first column I can use indexing:

df['MSC Primary'] = [x[:2] for x in df['MSC Primary']]

But this is not working for the other column, because there are several secondary MSC codes, different for each cell.

Thank you for your help, much appreciated.

Advertisement

Answer

Here are a couple of options:

Using a regex to get the first two chars of each “word”:

df["MSC Secondary"] = (
    df["MSC Secondary"]
    .str.extractall(r"[ (](w{2})")[0]
    .map(d)
    .groupby(level=0).agg(list)
)

Using:

  • str.extractall apply the regex [ (](w{2}) to get the first two characters from all words in each row
  • map map the dict, d over the [0] (zero-eth) match group
  • groupby(level=0).agg(list) to group the Series by index (level=0) and put them back into lists (.agg(list))

Through a few chained pandas str methods:

d = dict(...)
df["MSC Secondary"] = (
    df["MSC Secondary"]
    .str.strip("()")
    .str.split()
    .explode()
    .str[:2]
    .map(d)
    .groupby(level=0)
    .agg(list)
)

#   MSC Primary                                      MSC Secondary
# 0       05C25  [Combinatorics, Group theory and generalizations]
# 1       20-04     [Group theory and generalizations, Statistics]
# 2       13F20  [Nonassociative rings and algebras, Topologica...
# 3       05Exx  [Group theory and generalizations, Group theor...
# 4       20G40                                    [Combinatorics]

Here we use:

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement