Concat string in column values where it is missing…

I have a dataframe

import pandas as pd
data_as_dict={'CHROM': {2241743: 21, 2241744: 3, 2241745: 5, 2241746: 6, 2241747: 6, 2241748: 11, 2241749: 16, 2241750: 20, 0: 'chr1', 1: 'chr2', 2: 'chr3', 3: 'chr5', 4: 'chr5', 5: 'chr8', 6: 'chr9', 7: 'chr10', 8: 'chr14', 9: 'chr14', 10: 'chr15', 11: 'chr15'}, 'POS_GRCh38': {2241743: 41668857, 2241744: 189487243, 2241745: 33946466, 2241746: 396321, 2241747: 32641081, 2241748: 89284793, 2241749: 89960104, 2241750: 34077942, 0: '233276815', 1: '217427435', 2: '169800667', 3: '1279675', 4: '112150207', 5: '32575278', 6: '97775520', 7: '103934543', 8: '36063370', 9: '36269155', 10: '67165147', 11: '67163292'}, 'REF': {2241743: 'T', 2241744: 'A', 2241745: 'A', 2241746: 'C', 2241747: 'C', 2241748: 'G', 2241749: 'T', 2241750: 'G', 0: 'G', 1: 'G', 2: 'G', 3: 'C', 4: 'T', 5: 'T', 6: 'C', 7: 'C', 8: 'C', 9: 'C', 10: 'G', 11: 'C'}, 'Effect_allele': {2241743: 'A', 2241744: 'T', 2241745: 'G', 2241746: 'T', 2241747: 'T', 2241748: 'A', 2241749: 'C', 2241750: 'A', 0: 'A', 1: 'C', 2: 'T', 3: 'T', 4: 'A', 5: 'G', 6: 'A', 7: 'T', 8: 'G', 9: 'T', 10: 'C', 11: 'T'}, 'Effect_size': {2241743: 0.094310679, 2241744: 0.122217633, 2241745: 0.527632769, 2241746: 0.482426149, 2241747: 0.157003749, 2241748: 0.148420005, 2241749: 0.285178942, 2241750: 0.2390169, 0: 0.2776317365982795, 1: 0.3576744442718159, 2: 0.2070141693843261, 3: 0.1823215567939546, 4: 0.3148107398400336, 5: 0.2776317365982795, 6: 0.5247285289349821, 7: 0.3435897043900768, 8: 0.3293037471426003, 9: 0.5933268452777344, 10: 0.2070141693843261, 11: 0.2151113796169455}, 'TYPE': {2241743: 'Basal_cell_carcinoma_PRSWeb', 2241744: 'Squamous_cell_carcinoma_PRSWeb', 2241745: 'Squamous_cell_carcinoma_PRSWeb', 2241746: 'Squamous_cell_carcinoma_PRSWeb', 2241747: 'Squamous_cell_carcinoma_PRSWeb', 2241748: 'Squamous_cell_carcinoma_PRSWeb', 2241749: 'Squamous_cell_carcinoma_PRSWeb', 2241750: 'Squamous_cell_carcinoma_PRSWeb', 0: 'THYROID_PGS', 1: 'THYROID_PGS', 2: 'THYROID_PGS', 3: 'THYROID_PGS', 4: 'THYROID_PGS', 5: 'THYROID_PGS', 6: 'THYROID_PGS', 7: 'THYROID_PGS', 8: 'THYROID_PGS', 9: 'THYROID_PGS', 10: 'THYROID_PGS', 11: 'THYROID_PGS'}, 'Cancer': {2241743: 'NMSC', 2241744: 'NMSC', 2241745: 'NMSC', 2241746: 'NMSC', 2241747: 'NMSC', 2241748: 'NMSC', 2241749: 'NMSC', 2241750: 'NMSC', 0: 'THYROID', 1: 'THYROID', 2: 'THYROID', 3: 'THYROID', 4: 'THYROID', 5: 'THYROID', 6: 'THYROID', 7: 'THYROID', 8: 'THYROID', 9: 'THYROID', 10: 'THYROID', 11: 'THYROID'}, 'Significant_YN': {2241743: 'Y', 2241744: 'Y', 2241745: 'Y', 2241746: 'Y', 2241747: 'Y', 2241748: 'Y', 2241749: 'Y', 2241750: 'Y', 0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y', 5: 'Y', 6: 'Y', 7: 'Y', 8: 'Y', 9: 'Y', 10: 'Y', 11: 'Y'}}

all_cancers = pd.DataFrame.from_dict(data_as_dict)

I want to append string chr in column CHROM where it’s missing. I can do it in R with grepl and paste, but wanted to try in Python. I came up with these two commands, but not sure how to index the column because pd.Series is generating NaNs.

pd.Series(all_cancers['CHROM']).str.contains(pat = 'chr', regex = True)
"chr" + all_cancers['CHROM'].map(str)

Answer

String operations in pandas are not optimized, so the best way to do what you want is via a list comprehension.

# check if a value contains 'chr' and prepend it if not
all_cancers['CHROM'] = [x if isinstance(x, str) and 'chr' in x else f"chr{x}" for x in all_cancers['CHROM'].tolist()]

An equivalent pandas operation could be via mask() method.

# flag rows that starts with 'chr' and prepend 'chr' to values in the remaining rows 
all_cancers['CHROM'] = all_cancers['CHROM'].mask(~all_cancers['CHROM'].str.startswith('chr', na=False), 'chr'+all_cancers['CHROM'].astype(str))

A simple timeit test would show that the list comp is much faster.

Concat string in column values where it is missing in Python

Advertisement

Answer