How to extract a string from one column and save it in a new column in pandas dataframe?

This is my dataframe:

repository,sha1,url,refactorings
repo1,1,url1,"[{'type': 'Add Parameter', 'description': 'Add Parameter id : String in method public IssueFilter(repository Repository, id String) from class com.github.pockethub.android.core.issue.IssueFilter', 'leftSideLocations': [{'filePath': 'path2'}]]
repo2,2,url2,"[{'type': 'Add Parameter', 'description': 'Add Parameter id : String in method public IssueFilter(repository Repository, id String) from class com.github.pockethub.android.core.issue.IssueFilter', 'leftSideLocations': [{'filePath': 'path2'}]]

JavaScript
​x
 
repository,sha1,url,refactorings
repo1,1,url1,"[{'type': 'Add Parameter', 'description': 'Add Parameter id : String in method public IssueFilter(repository Repository, id String) from class com.github.pockethub.android.core.issue.IssueFilter', 'leftSideLocations': [{'filePath': 'path2'}]]
repo2,2,url2,"[{'type': 'Add Parameter', 'description': 'Add Parameter id : String in method public IssueFilter(repository Repository, id String) from class com.github.pockethub.android.core.issue.IssueFilter', 'leftSideLocations': [{'filePath': 'path2'}]]
​

I want to extract from refactorings column : Add parameter which is the type and com.github.pockethub.android.core.issue.IssueFilter which is after from class and put them into a new column and then delete refactorings column.

The Wanted datframe is:

repository,sha1,url,refac, class
repo1,1,url1,Add Parameter, com.github.pockethub.android.core.issue.IssueFilter
repo2,2,url2,Add Parameter, com.github.pockethub.android.core.issue.IssueFilter

JavaScript
 
repository,sha1,url,refac, class
repo1,1,url1,Add Parameter, com.github.pockethub.android.core.issue.IssueFilter
repo2,2,url2,Add Parameter, com.github.pockethub.android.core.issue.IssueFilter
​

this is my code:

df= pd.read_csv('data.csv', sep=',')

df1 = df[['sha1','url','refactorings']]
df1['refac']=df.refactorings.str.extract(r'[C|c]lasss*([^ ]*)')
df1['class']=df.refactorings.str.extract(r"type':'s*([^ ]*)")
del df1['refactorings']
a=df1.loc[~df1.sha1.duplicated(keep='last')]

list=[]
for elm in a['sha1']:
    list.append(elm)
dicts = {key: d for key, d in df.groupby('sha1')}
lenght=len(list)
for i in range(lenght):
    output1="output"+str(i)+".csv"
    a=dicts[list[i]]
    m=pd.DataFrame.from_dict(a) 
    m.to_csv(output1, index=False, na_rep='NaN')

JavaScript
 
df= pd.read_csv('data.csv', sep=',')
​
df1 = df[['sha1','url','refactorings']]
df1['refac']=df.refactorings.str.extract(r'[C|c]lasss*([^ ]*)')
df1['class']=df.refactorings.str.extract(r"type':'s*([^ ]*)")
del df1['refactorings']
a=df1.loc[~df1.sha1.duplicated(keep='last')]
​
list=[]
for elm in a['sha1']:
    list.append(elm)
dicts = {key: d for key, d in df.groupby('sha1')}
lenght=len(list)
for i in range(lenght):
    output1="output"+str(i)+".csv"
    a=dicts[list[i]]
    m=pd.DataFrame.from_dict(a) 
    m.to_csv(output1, index=False, na_rep='NaN')
​

It did not extract correctly the refac and class: For the refac it return 'Add and for the class it return com.github.pockethub.android.core.issue.IssueFilter', also it did not create any new column and it did not delete refactorings column!

Answer

use regexp with str.extract()

obj = df['refactorings'].astype(str)

df['refac'] = obj.str.extract("'type': '(.*?)'")
df['class'] = obj.str.extract("from class (.*?)'")

df[['repository', 'sha1', 'url', 'refac', 'class']]

JavaScript
 
obj = df['refactorings'].astype(str)
​
df['refac'] = obj.str.extract("'type': '(.*?)'")
df['class'] = obj.str.extract("from class (.*?)'")
​
df[['repository', 'sha1', 'url', 'refac', 'class']]
​

Advertisement

Answer