Web scraping python (beautifull soup) multiple page and subpage

Question

I create my soup with : I'm trying to create a dataframe from web scraping this site "https://myanimelist.net" et and i would like to get in a first step anime title, eps, type and secondly in detail of each anime (page like that : https://myanimelist.net/anime/2928/hack__GU_Returner) i would like to gather the score that user assigned contains in (for example :

Accepted Answer

This can be done directly with pandas using the read_html() function:import pandas as pd import stringdf = pd.DataFrame()for i in string.ascii_uppercase[:1]:#[:27]:    url = "https://myanimelist.net/anime.php?letter={}".format(i)    print url    tables = pd.read_html(url, header=0)    if df.empty:        df = tables[2]    else:        df = pd.concat([df, tables[2]])print df    This returns a list of ALL tables found at a given URL. In your case, you only need the second table. This would give you a dataframe starting:    Unnamed: 0                                              Title     Type  Eps.  Score0          NaN  A Kite add  Sawa is a school girl, an orphan, ...      OVA     2   6.671          NaN  A Piece of Phantasmagoria add  A collection of...      OVA    15   6.252          NaN  A Play add  Music Video for the group ALT, mad...    Music     1   4.623          NaN  A Smart Experiment add  Bonus short included o...  Special     1   4.954          NaN  A-Channel add  Tooru and Run have been best fr...       TV    12   7.04To do this using BeautifulSoup, you could use the following approach:from bs4 import BeautifulSoupimport pandas as pd import stringimport requestscolumns = [u'Title', u'Type', u'Eps.', u'Score']df = pd.DataFrame()for i in string.ascii_uppercase[:27]:    url = "https://myanimelist.net/anime.php?letter={}".format(i)    r = requests.get(url)    soup = BeautifulSoup(r.text, 'html.parser')        table = soup.find_all('table')[2]    for tr in table.find_all('tr')[1:]:        row = [td.get_text(strip=True) for td in tr.find_all('td')[1:5]]        url_sub = tr.find('a')['href']        print url_sub        r_sub = requests.get(url_sub)        soup_sub = BeautifulSoup(r_sub.text, 'html.parser')        all_scores = []     # each title has multiple lists of scores        # Select all of the user assigned score tables        for div in soup_sub.select('div.spaceit.textReadability.word-break.pt8.mt8'):            scores = []     # scores for one block            for tr_sub in div.div.table.find_all('tr'):                scores.append([td_sub.text for td_sub in tr_sub.find_all('td')])            all_scores.append(scores)        print all_scores    # These probably need adding to the row. Not all have scores.        df_row = pd.DataFrame([row], columns=columns)        if df.empty:            df = df_row        else:            df = pd.concat([df, df_row])print dfFor each film, a list of all the scores found is created and appended to all_scores although it is not clear how you would this added to your main dataframe.For example, scores could look like:https://myanimelist.net/anime/320/A_Kite[[[u'Overall', u'8'], [u'Story', u'8'], [u'Animation', u'7'], [u'Sound', u'7'], [u'Character', u'7'], [u'Enjoyment', u'8']], [[u'Overall', u'8'], [u'Story', u'8'], [u'Animation', u'10'], [u'Sound', u'0'], [u'Character', u'7'], [u'Enjoyment', u'10']], [[u'Overall', u'7'], [u'Story', u'7'], [u'Animation', u'8'], [u'Sound', u'6'], [u'Character', u'7'], [u'Enjoyment', u'8']], [[u'Overall', u'2'], [u'Story', u'2'], [u'Animation', u'2'], [u'Sound', u'2'], [u'Character', u'2'], [u'Enjoyment', u'2']]]

Advertisement

Answer