Skip to content
Advertisement

CSV data preprocess

I have a .csv file like this format

enter image description here

Then I want to convert it to

enter image description here

How can I do it with python pandas

Thank you

Advertisement

Answer

If you load it then you will have DataFrame like

      Y  M  1  2  3
0  2019  1  A  E  H
1  2020  2  B  F  I
2  2021  3  C  G  J

Set multi-index usinig year and month

df = df.set_index(['Y','M'])
        1  2  3
Y    M         
2019 1  A  E  H
2020 2  B  F  I
2021 3  C  G  J

Reshape it using stack()

df = df.stack()
Y     M   
2019  1  1    A
         2    E
         3    H
2020  2  1    B
         2    F
         3    I
2021  3  1    C
         2    G
         3    J
dtype: object

You can add name to column with day

df.index.set_names(['Y', 'M', 'D'], inplace=True)
Y     M  D
2019  1  1    A
         2    E
         3    H
2020  2  1    B
         2    F
         3    I
2021  3  1    C
         2    G
         3    J
dtype: object

Reset index to get normal columns

df = df.reset_index()
      Y  M  D  0
0  2019  1  1  A
1  2019  1  2  E
2  2019  1  3  H
3  2020  2  1  B
4  2020  2  2  F
5  2020  2  3  I
6  2021  3  1  C
7  2021  3  2  G
8  2021  3  3  J

Create column with dates

df['date'] = df.apply(lambda row: "{}/{}/{}".format(row['D'], row['M'], row['Y']), axis=1)
      Y  M  D  0      date
0  2019  1  1  A  1/1/2019
1  2019  1  2  E  2/1/2019
2  2019  1  3  H  3/1/2019
3  2020  2  1  B  1/2/2020
4  2020  2  2  F  2/2/2020
5  2020  2  3  I  3/2/2020
6  2021  3  1  C  1/3/2021
7  2021  3  2  G  2/3/2021
8  2021  3  3  J  3/3/2021

You can remove year, month, day

df.drop(['Y', 'M','D'], axis=1, inplace=True)
   0      date
0  A  1/1/2019
1  E  2/1/2019
2  H  3/1/2019
3  B  1/2/2020
4  F  2/2/2020
5  I  3/2/2020
6  C  1/3/2021
7  G  2/3/2021
8  J  3/3/2021

You can also rename column

df.rename(columns={0:'value'}, inplace=True)
  value      date
0     A  1/1/2019
1     E  2/1/2019
2     H  3/1/2019
3     B  1/2/2020
4     F  2/2/2020
5     I  3/2/2020
6     C  1/3/2021
7     G  2/3/2021
8     J  3/3/2021

And you can change order of columns

df = df[['date', 'value']]
       date value
0  1/1/2019     A
1  2/1/2019     E
2  3/1/2019     H
3  1/2/2020     B
4  2/2/2020     F
5  3/2/2020     I
6  1/3/2021     C
7  2/3/2021     G
8  3/3/2021     J

Minimal working code

import pandas as pd

data = {
    'Y': [2019, 2020, 2021], 
    'M': [1,2,3], 
    '1': ['A','B','C'],
    '2': ['E','F','G'],
    '3': ['H','I','J'],
}

df = pd.DataFrame(data)
print(df)

df = df.set_index(['Y','M'])
print(df)

df = df.stack()
print(df)

df.index.set_names(['Y', 'M', 'D'], inplace=True)
print(df)

df = df.reset_index()
print(df)

df['date'] = df.apply(lambda row: "{}/{}/{}".format(row['D'], row['M'], row['Y']), axis=1)
print(df)

df.drop(['Y', 'M','D'], axis=1, inplace=True)
print(df)

df.rename(columns={0:'value'}, inplace=True)
print(df)

df = df[['date', 'value']]
print(df)
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement