Skip to content
Advertisement

Extract specific symbols from pandas cells, then replace them with values from a dict where they are keys

My data looks like this, well these are two of the columns:

JavaScript

These are MSC codes, corresponding to different areas of science. I need to replace each code with the corresponding subject from this dict here: https://mathscinet.ams.org/msnhtml/msc2020.pdf , some of them are: “”” 00 General and overarching topics; collections 01 History and biography 03 Mathematical logic and foundations 05 Combinatorics “””

First I need to isolate the first two digits from each code, for instance 05C25 to transform to 05 or from the second column (05E15 14H50) to transform to 05, 14.

Then I need each number replaced by the corresponding science for example 05, 14 to become Combinatorics, Algebraic geometry. This is all tricky form be because I am new to Python and the second column has different number of MSC codes in each cell so I cannot use indexing there.

I know for the first column I can use indexing:

JavaScript

But this is not working for the other column, because there are several secondary MSC codes, different for each cell.

Thank you for your help, much appreciated.

Advertisement

Answer

Here are a couple of options:

Using a regex to get the first two chars of each “word”:

JavaScript

Using:

  • str.extractall apply the regex [ (](w{2}) to get the first two characters from all words in each row
  • map map the dict, d over the [0] (zero-eth) match group
  • groupby(level=0).agg(list) to group the Series by index (level=0) and put them back into lists (.agg(list))

Through a few chained pandas str methods:

JavaScript

Here we use:

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement