My data looks like this, well these are two of the columns:
Index MSC Primary MSC Secondary 0 05C25 (05C20 20F05) 1 20-04 (20F55 62Qxx) 2 13F20 (17B20 22E40 22F30 65Fxx) 3 05Exx (20-04 20H20) 4 20G40 (05C25)
These are MSC codes, corresponding to different areas of science. I need to replace each code with the corresponding subject from this dict here: https://mathscinet.ams.org/msnhtml/msc2020.pdf , some of them are: “”” 00 General and overarching topics; collections 01 History and biography 03 Mathematical logic and foundations 05 Combinatorics “””
First I need to isolate the first two digits from each code, for instance 05C25
to transform to 05
or from the second column (05E15 14H50)
to transform to 05, 14
.
Then I need each number replaced by the corresponding science for example 05, 14
to become Combinatorics, Algebraic geometry
.
This is all tricky form be because I am new to Python and the second column has different number of MSC codes in each cell so I cannot use indexing there.
I know for the first column I can use indexing:
df['MSC Primary'] = [x[:2] for x in df['MSC Primary']]
But this is not working for the other column, because there are several secondary MSC codes, different for each cell.
Thank you for your help, much appreciated.
Advertisement
Answer
Here are a couple of options:
Using a regex to get the first two chars of each “word”:
df["MSC Secondary"] = ( df["MSC Secondary"] .str.extractall(r"[ (](w{2})")[0] .map(d) .groupby(level=0).agg(list) )
Using:
str.extractall
apply the regex[ (](w{2})
to get the first two characters from all words in each rowmap
map the dict,d
over the[0]
(zero-eth) match groupgroupby(level=0).agg(list)
to group the Series by index (level=0
) and put them back into lists (.agg(list)
)
Through a few chained pandas str methods:
d = dict(...) df["MSC Secondary"] = ( df["MSC Secondary"] .str.strip("()") .str.split() .explode() .str[:2] .map(d) .groupby(level=0) .agg(list) ) # MSC Primary MSC Secondary # 0 05C25 [Combinatorics, Group theory and generalizations] # 1 20-04 [Group theory and generalizations, Statistics] # 2 13F20 [Nonassociative rings and algebras, Topologica... # 3 05Exx [Group theory and generalizations, Group theor... # 4 20G40 [Combinatorics]
Here we use:
pandas.Series.str.strip
to remove the parenthesespandas.Series.str.split
to split the substrings into listspandas.Series.explode
to turn every element in each list into it’s own rowstr[:2]
to slice of the first two charactersmap
to map your linked dictgrouby(level=0).agg(list)
to group the Series by index (level=0
) and put them back into lists (.agg(list)
)