pandas split values in column

I’m new to pandas (version 1.1.5) and have tried str.split() and str.extract() to split column POS of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _ and - delimiters in subset):

df.head()

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965-70531965       C       G
5  Sample5     6    26093141-26093141       G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484_103527848       G       A

JavaScript
​x
 
df.head()
​
      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965-70531965       C       G
5  Sample5     6    26093141-26093141       G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484_103527848       G       A
​
​

I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965        C       G
5  Sample5     6    26093141        G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484       G       A

JavaScript
 
      SAMPLE CHROM        POS REF ALT
Sample1     7    105121514       C       T
Sample2    17    7359940         C       A
Sample3     X    76777781        A       G
Sample4    16    70531965        C       G
Sample5     6    26093141        G       A
Sample6    12    11905465        C       T
Sample7     4    103527484       G       A
​

My attempts have either split the rows only containing a delimiter and dropping all other rows, dropping all rows containing just the delimiters, or dropping all values.

For example, df['POS'] = df['POS'].str.replace(r'[-|_]d+', '') outputs:

      SAMPLE CHROM  POS REF ALT
1  Sample1     7    NaN   C   T
2  Sample2    17    NaN   C   A
3  Sample3     X    NaN   A   G
4  Sample4    16    NaN   C   G
5  Sample5     6    NaN   G   A
6  Sample6    12    NaN   C   T
7  Sample7     4    NaN   G   A

JavaScript
 
      SAMPLE CHROM  POS REF ALT
Sample1     7    NaN   C   T
Sample2    17    NaN   C   A
Sample3     X    NaN   A   G
Sample4    16    NaN   C   G
Sample5     6    NaN   G   A
Sample6    12    NaN   C   T
Sample7     4    NaN   G   A
​

Accepting the solution from @PaulS below as I needed to convert the column datatype from object to string first in order for str.replace() to work!

df.dtypes

SAMPLE    object
CHROM     object
POS       object
REF       object
ALT       object
dtype: object

df['POS'] = df['POS'].astype('str')
df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')

JavaScript
 
df.dtypes
​
SAMPLE    object
CHROM     object
POS       object
REF       object
ALT       object
dtype: object
​
df['POS'] = df['POS'].astype('str')
df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')
​

Answer

A possible solution, based on the idea of replacing all characters after _ or - (inclusive) with the empty string (''):

df['POS'] = df['POS'].str.replace(r'[-_]d+', '')

JavaScript
 
df['POS'] = df['POS'].str.replace(r'[-_]d+', '')
​

Output:

  CHROM        POS REF ALT
0     7  105121514   C   T
1    17    7359940   C   A
2     X   76777781   A   G
3    16   70531965   C   G
4     6   26093141   G   A
5    12   11905465   C   T
6     4  103527484   G   A

JavaScript
 
  CHROM        POS REF ALT
   7  105121514   C   T
  17    7359940   C   A
   X   76777781   A   G
  16   70531965   C   G
   6   26093141   G   A
  12   11905465   C   T
   4  103527484   G   A
​

Advertisement

Answer