I’m trying to scrape jpg images from each products, every product url saved in csv. Image links are available in json data so try to access json key value. When I try to run code it’s only getting back with all key value in spite of image url link, and second my code only able to scrape last product url in spite of all url saved in csv.
{'name': {'b': {'src': {'xs': 'https://ctl.s6img.com/society6/img/xVx1vleu7iLcR79ZkRZKqQiSzZE/w_125/artwork/~artwork/s6-0041/a/18613683_5971445', 'lg': 'https://ctl.s6img.com/society6/img/W-ESMqUtC_oOEUjx-1E_SyIdueI/w_550/artwork/~artwork/s6-0041/a/18613683_5971445', 'xl': 'https://ctl.s6img.com/society6/img/z90VlaYwd8cxCqbrZ1ttAxINpaY/w_700/artwork/~artwork/s6-0041/a/18613683_5971445', 'xxl': None}, 'type': 'image', 'alt': "I'M NOT ALWAYS A BITCH (Red) Cutting Board", 'meta': None}, 'c': {'src': {'xs': 'https://ctl.s6img.com/society6/img/KQJbb4jG0gBHcqQiOCivLUbKMxI/w_125/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'lg': 'https://ctl.s6img.com/society6/img/ztGrxSpA7FC1LfzM3UldiQkEi7g/w_550/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xl': 'https://ctl.s6img.com/society6/img/PHjp9jDic2NGUrpq8k0aaxsYZr4/w_700/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xxl': 'https://ctl.s6img.com/society6/img/m-1HhSM5CIGl6DY9ukCVxSmVDIw/w_1500/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg'}, 'type': 'image', 'alt': "I'M NOT ALWAYS A BITCH (Red) Cutting Board", 'meta': None}, 'd': {'src': {'xs': 'https://ctl.s6img.com/society6/img/G9TikRnVvy1w0kwKCAmgWsWy42Q/w_125/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'lg': 'https://ctl.s6img.com/society6/img/uVOYOxbHmhrNhmGQAi6QeydrFdY/w_550/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xl': 'https://ctl.s6img.com/society6/img/-WIIUx9oB6jQKJdkSkq2ofhjLzc/w_700/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xxl': 'https://ctl.s6img.com/society6/img/HlSFppIm7Wk6aVxO17fI4b5s0ts/w_1500/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg'}, 'type': 'image', 'alt': "I'M NOT ALWAYS A BITCH (Red) Cutting Board", 'meta': None}}}
This is the json data. I only want to scrape jpg image link. Below is my code:
import json import csv from urllib.request import urlopen from bs4 import BeautifulSoup import pandas as pd contents = [] with open('test.csv','r') as csvf: # Open file in read mode urls = csv.reader(csvf) for url in urls: contents.append(url) # Add each url to list contents newlist = [] for url in contents: try: page = urlopen(url[0]).read() soup = BeautifulSoup(page, 'html.parser') scripts = soup.find_all('script')[7].text.strip()[24:] data = json.loads(scripts) link = data['product']['response']['product']['data']['attributes']['media_map'] except: link = 'no data' detail = { 'name': link } print(detail) newlist.append(detail) df = pd.DataFrame(detail) df.to_csv('s1.csv')
I’m trying to scrape all jpg image link and I save csv file having each product url so I want to open csv file and loop each url.
Advertisement
Answer
Few things:
df = pd.DataFrame(detail)
should bedf = pd.DataFrame(newlist)
- You’re loop indentation is off. In fact, why are you looping the urls twice? You get the url from the test.csv (you should just use pandas for that anyway), puting the url into
contents
list, then loop through that list.
Try this:
import json import csv from urllib.request import urlopen from bs4 import BeautifulSoup import pandas as pd contents = [] with open('test.csv','r') as csvf: # Open file in read mode urls = csv.reader(csvf) for url in urls: try: page = urlopen(url[0]).read() soup = BeautifulSoup(page, 'html.parser') scripts = soup.find_all('script')[7].text.strip()[24:] data = json.loads(scripts) link = data['product']['response']['product']['data']['attributes']['media_map'] except: link = 'no data' detail = { 'name': link } print(detail) contents.append(detail) df = pd.DataFrame(contents) df.to_csv('s1.csv')