pandas split values in column

I’m new to pandas (version 1.1.5) and have tried str.split() and str.extract() to split column POS of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _ and - delimiters in subset):

df.head()

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965-70531965       C       G
5  Sample5     6    26093141-26093141       G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484_103527848       G       A

I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):

      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965        C       G
5  Sample5     6    26093141        G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484       G       A

My attempts have either split the rows only containing a delimiter and dropping all other rows, dropping all rows containing just the delimiters, or dropping all values.

For example, df['POS'] = df['POS'].str.replace(r'[-|_]d+', '') outputs:

      SAMPLE CHROM  POS REF ALT
1  Sample1     7    NaN   C   T
2  Sample2    17    NaN   C   A
3  Sample3     X    NaN   A   G
4  Sample4    16    NaN   C   G
5  Sample5     6    NaN   G   A
6  Sample6    12    NaN   C   T
7  Sample7     4    NaN   G   A

Accepting the solution from @PaulS below as I needed to convert the column datatype from object to string first in order for str.replace() to work!

df.dtypes

SAMPLE    object
CHROM     object
POS       object
REF       object
ALT       object
dtype: object

df['POS'] = df['POS'].astype('str')
df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')

Answer

A possible solution, based on the idea of replacing all characters after _ or - (inclusive) with the empty string (''):

df['POS'] = df['POS'].str.replace(r'[-_]d+', '')

Output:

  CHROM        POS REF ALT
0     7  105121514   C   T
1    17    7359940   C   A
2     X   76777781   A   G
3    16   70531965   C   G
4     6   26093141   G   A
5    12   11905465   C   T
6     4  103527484   G   A

Advertisement

Answer