I have a dataset with around 40,000 rows each representing a record in dataset. One of the features named ‘region_code’ is categorical in nature but is represented using integer. It is similar to pincode/zipcode. There are around 5316 unique ‘region_code’ values and these Region_Codes start from 1 and go upto 5690. That means, range is [1,5690]. I want to reassign those values such that first 20 region codes that is all the rows where region code lies in the range [1,20] will be assigned region code as ‘1’, next batch of region codes ie. [21,40] will be assigned region code of ‘2’, next batch of region codes i.e. [41,60] will be assigned region code of ‘3’ and so on. Last batch of 20 region codes i.e. 5681 to 5700 will have value ‘285’ (5700//20).
I can do this using if-else, but then I will have to write 285 if-else conditions each representing one condition for one batch of 20 region codes, but it is not the right approach as it will be too much manual work. I need a short and succinct code for this.
To simulate the problem so that I you can write code for it, I have created a small dataframe with region codes from 1 to 50. Here, let us group it into batches of 5. So, first 5 region codes will get value ‘1’, next 5 region codes will get value ‘2’ and so on till last batch of region codes which will get value of ’10’.
Region_Code = np.arange(1,51) pd.DataFrame(Region_Code, columns =['Region_Code'])
Exprected output will look like the one created by code below :
transformed = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7,7,8,8,8,8,8,9,9,9,9,9,10,10,10,10,10] pd.DataFrame(transformed, columns=['Region_Code_new'])
I have manually created that list to give you a glimpse of how output would look like.
In our original question we have to do batches of 20 region codes each and therefore there will be 285 such batches. My question is how to do this using for loop or some similar logic ?
Advertisement
Answer
You can just floor divide
the column with 5 (20 in your original dataset):
>>> Region_Code = np.arange(1,51) >>> pd.DataFrame(Region_Code, columns =['Region_Code']) >>> df.assign(Region_code_new=(df.Region_Code.sub(1) // 5) + 1) Region_Code Region_code_new 0 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 2 6 7 2 7 8 2 8 9 2 9 10 2 10 11 3 11 12 3 12 13 3 13 14 3 14 15 3 15 16 4 16 17 4 17 18 4 18 19 4 19 20 4 20 21 5 21 22 5 22 23 5 23 24 5 24 25 5 25 26 6 26 27 6 27 28 6 28 29 6 29 30 6 30 31 7 31 32 7 32 33 7 33 34 7 34 35 7 35 36 8 36 37 8 37 38 8 38 39 8 39 40 8 40 41 9 41 42 9 42 43 9 43 44 9 44 45 9 45 46 10 46 47 10 47 48 10 48 49 10 49 50 10