I am able to traverse through files in a directory with my script but unable to apply the same logic to when all the transcriptions are in a table/dataframe. My earlier script –
import os from glob import glob import pandas as pd import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer files = glob('C:/Users/jj/Desktop/Bulk_Wav_Completed_CancelsvsSaves/*.csv') sid = SentimentIntensityAnalyzer() # use dict comprehension to apply you analysis data = {os.path.basename(file): sid.polarity_scores(' '.join(pd.read_csv(file, encoding="utf-8")['transcript'])) for file in files} # create a data frame from the dictionary above df = pd.DataFrame.from_dict(data, orient='index') df.to_csv("sentimentcancelvssaves.csv")
How do I apply the above to the below table where
dfo Out[52]: InteractionId Agent Transcript 0 100392327420210105 David Michel hi how are you 1 100392327420210105 David Michel yes i am not fine 2 100390719220210104 Mindy Campbell .,xyz.. 3 100390719220210104 Mindy Campbell no 4 100390719220210104 Mindy Campbell maybe ... ... ... ... ... 93407 300390890320200915 Sandra Yacklin ... 93408 300390890320200915 Sandra Yacklin ... 93409 300390890320200915 Sandra Yacklin ...
So as you see here, I have a column interaction id which is unique. I my final data set to give me 1 row per id and I require to get the polarity scores of the sentiments attached to that id.
Desired output for 100390719220210104 –
InteractionId Agent Transcript Positive Compound 2 100390719220210104 Mindy Campbell xyz no maybe 0.190 0.5457
How can I do this for all interaction id? I was able to do it when i had to apply my script to all transcripts csvs in a directory and iterate through them all. However, how can I apply that to a dataframe where all the data is in one place and not different csvs
Advertisement
Answer
So rather than looping through the files, you are looping through the unique InteractionIds. You can get that using: for interaction_id in dfo['InteractionId'].unique()
And then you are joining the values in that column for that ID which you can get by:
' '.join(dfo[dfo['InteractionId'] == interaction_id]['Transcript'])
Putting it together you have:
import os from glob import glob import nltk import pandas as pd from nltk.sentiment.vader import SentimentIntensityAnalyzer dfo = pd.DataFrame( data={ 'InteractionId': [ 100392327420210105, 100390719220210104, 100390719220210104, 100390719220210104, ], 'Transcript': ['hi how are you', '.,xyz..', 'no', 'maybe'], } ) sid = SentimentIntensityAnalyzer() # use dict comprehension to apply you analysis data = { interaction_id: sid.polarity_scores( ' '.join(dfo[dfo['InteractionId'] == interaction_id]['Transcript']) ) for interaction_id in dfo['InteractionId'].unique() } # create a data frame from the dictionary above df = pd.DataFrame.from_dict(data, orient='index') df.to_csv("sentimentcancelvssaves.csv")