I’m new to pandas (version 1.1.5) and have tried str.split() and str.extract() to split column POS of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _ and - delimiters in subset):
df.head()
      SAMPLE CHROM        POS REF ALT
1  Sample1     7    105121514       C       T
2  Sample2    17    7359940         C       A
3  Sample3     X    76777781        A       G
4  Sample4    16    70531965-70531965       C       G
5  Sample5     6    26093141-26093141       G       A
6  Sample6    12    11905465        C       T
7  Sample7     4    103527484_103527848       G       A
I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):
SAMPLE CHROM POS REF ALT 1 Sample1 7 105121514 C T 2 Sample2 17 7359940 C A 3 Sample3 X 76777781 A G 4 Sample4 16 70531965 C G 5 Sample5 6 26093141 G A 6 Sample6 12 11905465 C T 7 Sample7 4 103527484 G A
My attempts have either split the rows only containing a delimiter and dropping all other rows, dropping all rows containing just the delimiters, or dropping all values.
For example, df['POS'] = df['POS'].str.replace(r'[-|_]d+', '') outputs:
SAMPLE CHROM POS REF ALT 1 Sample1 7 NaN C T 2 Sample2 17 NaN C A 3 Sample3 X NaN A G 4 Sample4 16 NaN C G 5 Sample5 6 NaN G A 6 Sample6 12 NaN C T 7 Sample7 4 NaN G A
Accepting the solution from @PaulS below as I needed to convert the column datatype from object to string first in order for str.replace() to work!
df.dtypes
SAMPLE    object
CHROM     object
POS       object
REF       object
ALT       object
dtype: object
df['POS'] = df['POS'].astype('str')
df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')
Advertisement
Answer
A possible solution, based on the idea of replacing all characters after _ or - (inclusive) with the empty string (''):
df['POS'] = df['POS'].str.replace(r'[-_]d+', '')
Output:
CHROM POS REF ALT 0 7 105121514 C T 1 17 7359940 C A 2 X 76777781 A G 3 16 70531965 C G 4 6 26093141 G A 5 12 11905465 C T 6 4 103527484 G A