My data looks like this, well these are two of the columns:
Index MSC Primary MSC Secondary 0 05C25 (05C20 20F05) 1 20-04 (20F55 62Qxx) 2 13F20 (17B20 22E40 22F30 65Fxx) 3 05Exx (20-04 20H20) 4 20G40 (05C25)
These are MSC codes, corresponding to different areas of science. I need to replace each code with the corresponding subject from this dict here: https://mathscinet.ams.org/msnhtml/msc2020.pdf , some of them are: “”” 00 General and overarching topics; collections 01 History and biography 03 Mathematical logic and foundations 05 Combinatorics “””
First I need to isolate the first two digits from each code, for instance 05C25 to transform to 05 or from the second column (05E15 14H50) to transform to 05, 14.
Then I need each number replaced by the corresponding science for example 05, 14 to become Combinatorics, Algebraic geometry.
This is all tricky form be because I am new to Python and the second column has different number of MSC codes in each cell so I cannot use indexing there.
I know for the first column I can use indexing:
df['MSC Primary'] = [x[:2] for x in df['MSC Primary']]
But this is not working for the other column, because there are several secondary MSC codes, different for each cell.
Thank you for your help, much appreciated.
Advertisement
Answer
Here are a couple of options:
Using a regex to get the first two chars of each “word”:
df["MSC Secondary"] = (
    df["MSC Secondary"]
    .str.extractall(r"[ (](w{2})")[0]
    .map(d)
    .groupby(level=0).agg(list)
)
Using:
- str.extractallapply the regex- [ (](w{2})to get the first two characters from all words in each row
- mapmap the dict,- dover the- [0](zero-eth) match group
- groupby(level=0).agg(list)to group the Series by index (- level=0) and put them back into lists (- .agg(list))
Through a few chained pandas str methods:
d = dict(...)
df["MSC Secondary"] = (
    df["MSC Secondary"]
    .str.strip("()")
    .str.split()
    .explode()
    .str[:2]
    .map(d)
    .groupby(level=0)
    .agg(list)
)
#   MSC Primary                                      MSC Secondary
# 0       05C25  [Combinatorics, Group theory and generalizations]
# 1       20-04     [Group theory and generalizations, Statistics]
# 2       13F20  [Nonassociative rings and algebras, Topologica...
# 3       05Exx  [Group theory and generalizations, Group theor...
# 4       20G40                                    [Combinatorics]
Here we use:
- pandas.Series.str.stripto remove the parentheses
- pandas.Series.str.splitto split the substrings into lists
- pandas.Series.explodeto turn every element in each list into it’s own row
- str[:2]to slice of the first two characters
- mapto map your linked dict
- grouby(level=0).agg(list)to group the Series by index (- level=0) and put them back into lists (- .agg(list))