I was working on the following code which will take input the scrap which has a few phrases as:
scrap= ['Mutagenesis screens define conserved functions of metabolism and longevity', 'EK Bharath Shrestha Bharat(EBSB) - 100 commonly used sentences and their translations in 22 languages - P & D', 'OEB Special Seminar: “Phylogenetics and phylogenomics of Lentinula and the origin of cultivated shiitake mushrooms”', 'Student Exchange programme (Autumn Semester 2022) in University of Skovde, Sweden - CIR - Last Date: 04.03.2022', 'Ontario Institute for Studies in Education', 'Q Quest 2022 - AU TVS CQM - Last Date: 01.03.2022', 'National Conference on "Present Innovation Approaches and Paradigm in Physical Education"', 'Mahatma Gandhi University Newsletter ‘Insider’-Published.', 'STAGE Seminar', 'BOSM', 'Faculty of Law', 'UNIVERSITY UNION ELECTION 2019-20', 'Keynote Lecture: Sustainability for Africa: ...', 'Hillary Chute, "Maus Now: Spiegelman’s...', 'Conference on ‘Sustainable agriculture and farmers empowerment’ during 16th and 17th March 2021.', 'MIT Probability Seminar', 'Name of Programme', '49th All India Conference of Dravidian Linguists', 'Grad College Social Hour (GC common lounge)', 'MIT Symphony Orchestra: Márquez, Sarasate, and...', 'SCSB Colloquium Series: Etiology and impact of...', 'Celebration of National Science Day on 28th February 2022 - Dept. of Physics', 'PICASSO Tie-dye Event', 'Lunch & Learn with Muslim Life Program', '2022 Koch Institute Image Awards', 'Ideas & Images: The Power of Visual...', '30 Minutes Towards Better Bibliographies and Footnotes! (online)', 'Virtual Workshop on "Flight to a Bright Career-Enhance your Personality"', '4th Disaster Risk and Vulnerability Conference organised by SES scheduled on Oct 9-10 & 16-17.', 'French Education Fair 2022 organized by Campus France - CIR']
Now I want the phrases in scrap which have used the words in prog_list to be appended to TRUE_PROG :
prog_list=['writing', 'cryptography', 'recoding', 'decoding', 'program', 'code', 'planning', 'programming', 'encoding', 'gull', 'scheduling', 'tease', 'program', 'code'] TRUE_PROG =[]
I wrote a simple code having loops in it, but it produces an output that I didn’t expect:
PROGRAM CODE:
TRUE_PROG=[] MIS_PROG=[] c_list = [] p = string.punctuation punc = list(p) for i in scrap: # print(i) words_in_scrap = i.split() for j in words_in_scrap: words = j.lower() for k in words: # print(k) if k in punc: words = words.replace(k ," ") #CLEANSED DATA clean = words # print("clean=",clean) c_list.append(clean) # print("c_list=:",c_list) for c in c_list: if c ==" ": c_list.remove(c) # print("c_list cleaned of spaces=",c_list) for t in c_list: if t in prog_list: TRUE_PROG.append(i) #print("ni=",i,"due to t=",t) else: MIS_PROG.append(i) # print("nnPROG=",set(TRUE_PROG),"nnn MIS_PROG=", set(MIS_PROG),"n")
If you uncomment #print("ni=",i,"due to t=",t)
You’ll find that some phrases that didn’t even have those words were also appended. It gives me this:
i= Lunch & Learn with Muslim Life Program due to t= program i= 2022 Koch Institute Image Awards due to t= program i= Ideas & Images: The Power of Visual... due to t= program i= 30 Minutes Towards Better Bibliographies and Footnotes! (online) due to t= program i= Virtual Workshop on "Flight to a Bright Career-Enhance your Personality" due to t= program
and so on. Except the first one, even though the rest don’t have the word, “program”, they still were added. Any correction will be highly valued. Thanks!
Advertisement
Answer
ps = list(set(prog_list)) for p in ps: for s in scrap: words = s.split() for w in words: if p == w.lower(): r = s+f" - due to the word {p}" TRUE_PROG.append(r) print(TRUE_PROG)
Output:
['Lunch & Learn with Muslim Life Program - due to the word program']