I’m new to pandas (version 1.1.5) and have tried str.split()
and str.extract()
to split column POS
of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _
and -
delimiters in subset):
df.head() SAMPLE CHROM POS REF ALT 1 Sample1 7 105121514 C T 2 Sample2 17 7359940 C A 3 Sample3 X 76777781 A G 4 Sample4 16 70531965-70531965 C G 5 Sample5 6 26093141-26093141 G A 6 Sample6 12 11905465 C T 7 Sample7 4 103527484_103527848 G A
I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):
SAMPLE CHROM POS REF ALT 1 Sample1 7 105121514 C T 2 Sample2 17 7359940 C A 3 Sample3 X 76777781 A G 4 Sample4 16 70531965 C G 5 Sample5 6 26093141 G A 6 Sample6 12 11905465 C T 7 Sample7 4 103527484 G A
My attempts have either split the rows only containing a delimiter and dropping all other rows, dropping all rows containing just the delimiters, or dropping all values.
For example, df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')
outputs:
SAMPLE CHROM POS REF ALT 1 Sample1 7 NaN C T 2 Sample2 17 NaN C A 3 Sample3 X NaN A G 4 Sample4 16 NaN C G 5 Sample5 6 NaN G A 6 Sample6 12 NaN C T 7 Sample7 4 NaN G A
Accepting the solution from @PaulS below as I needed to convert the column datatype from object to string first in order for str.replace()
to work!
df.dtypes SAMPLE object CHROM object POS object REF object ALT object dtype: object df['POS'] = df['POS'].astype('str') df['POS'] = df['POS'].str.replace(r'[-|_]d+', '')
Advertisement
Answer
A possible solution, based on the idea of replacing all characters after _
or -
(inclusive) with the empty string (''
):
df['POS'] = df['POS'].str.replace(r'[-_]d+', '')
Output:
CHROM POS REF ALT 0 7 105121514 C T 1 17 7359940 C A 2 X 76777781 A G 3 16 70531965 C G 4 6 26093141 G A 5 12 11905465 C T 6 4 103527484 G A