I have an xlsx
file containing too much data. however the data contains duplicate
values in column named UniversalIDS
which I wanted to replace it with a randomly generated IDS
with Pandas
.
So far I’ve tried different scenarios which I googled but did not work. for example I tried this:
import pandas as pd import uuid df = pd.read_excel('C:/Users/Nervous/Downloads/ss.xlsx') df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = uuid.uuid4() df.to_excel("C:/Users/Nervous/Downloads/tt.xlsx") print("done!")
also I tried other alternatives seen on this site like for example:
df2 = df.assign(UniversalIDS=df['UniversalIDS'].where( ~df.duplicated(['UniversalIDS']), uuid.uuid4()))
also this didn’t work:
df.loc[df["UniversalIDS"].duplicated(), "test"]
this is a snippet from xlsx data
:
UniversalIDS f6112cd7-0868-4cc9-b5ab-d7381cc23bdf f3e75641-f328-429f-ae32-41399d8a0bf0 08dccc5c-5774-4614-925c-ad9373821075 79a8ebed-154c-47c7-b16d-cbba5d8469eb 396f8e63-1950-4c36-9524-c1dec8728ffd 62cba3bd-a855-4141-8460-7ff838ecea62 b7f4f753-b479-413a-abcd-34d08436eb85 c0fd6e61-edb1-4dce-94ac-7d0529860e1f c42f8c98-c285-4552-af9f-2f4d8e89b9e8 8cb77021-eb4f-4cfa-a4a3-e649f6a53c03 cb7f4b8d-976a-4481-919e-c8ff7d384cc6 e15fd2bb-5409-4a8b-9fdc-bf1e5862db58 27b97893-aae7-4a9a-aae1-0fc21a099209 1abc2c2f-94f2-4546-b912-b85bc4ed0cb8 6bf264fb-1b82-48e3-966a-14e48a61a63e 9653faeb-7b3d-408e-93e3-bc729f259c75 a09f3eb6-0059-4a77-bf2f-4f7436508ba8 65e06948-2e6c-413f-a768-c3faf8108a6c 291ff491-4ff0-4fb2-b095-b3e66f2d7ca0 653535c7-0389-4077-8e72-3835fbd72d4d 61408fc8-4f45-48e0-b83a-40b6bfd76ad5 3ae8d547-bf4b-42ac-b083-a1662f1a5c82 4955c673-c5da-464c-8e14-a897df0774eb a39bad90-5235-4679-945e-534bb47b8347 264a1f6e-adf4-45a7-b1b1-e6f3fc073447 a855025b-ee84-46d5-aedb-cbac9a5b1920 71b16a5b-3f6d-4d30-8a65-203959fe87a2 4f3f86f2-4e61-475a-bc1d-eb2112f23953 59da45de-c192-4885-8a55-9138ca49b33a 8f41df73-d9dc-4663-9f64-d090d7c5ca77 84f7103f-e9de-444f-b046-c02d75af0ed1 2738f733-7438-494c-9368-5fb700df93d1 777a3cd7-19ae-4181-b91d-9be8eaf30523 b6083731-a43e-4b5a-ac9a-94a3202103e7 f22873c1-6811-4025-8f0d-47d72d49e499 f262c369-f44a-4b90-8219-d29b33bc14e8 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54 ea0d26f5-d8c4-4082-983a-8eeea29c6c54
as can be seen in the above there are a duplicate values in UniversalIDS
column, also it is worth to mention that there are other columns in the data but cut out the problem causing column for simplicity.
so my question is how can I replace the duplicate values in UniversalIDS
column with a new unique IDs?
Advertisement
Answer
Your expression:
df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = uuid.uuid4()
is correct python but it sets one uuid for all duplicated elements, which means the elements wil still be duplicated after it’s been executed. You should create a Series with distinct uuids:
df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = pd.Series([uuid.uuid4() for _ in df.duplicated(subset=["UniversalIDS"])])