Skip to content
Advertisement

How to scrape the text from this data?

 <script type="text/javascript">
/**
 * Define SVG path for target icon
 */
var targetSVG = "M9,0C4.029,0,0,4.029,0,9s4.029,9,9,9s9-4.029,9-9S13.971,0,9,0z M9,15.93 c-3.83,0-6.93-3.1-6.93-6.93S5.17,2.07,9,2.07s6.93,3.1,6.93,6.93S12.83,15.93,9,15.93 M12.5,9c0,1.933-1.567,3.5-3.5,3.5S5.5,10.933,5.5,9S7.067,5.5,9,5.5 S12.5,7.067,12.5,9z";

/**
 * Create the map
 */
var i=1;


var countrydataprovider = {
 "map": "indiaLow",
"getAreasFromMap": true,
  "theme": "none",
 
 "imagesSettings": {
    "rollOverColor": "#089282",
    "rollOverScale": 3,
"labelPosition": "middle",
    "labelFontSize": 8,
 "labelColor": "#fff",
    "selectedScale": 3,
    "selectedColor": "#089282",
    "color": "#13564e"
  },
"images": [
    {
        "imageURL": "nowcast_marker/map-marker-icon-png-green.png",
        "width": 20,
        "height": 20,
        "description": "<p>No Warning </br></br> Time of issue: 2022-10-07</br>1005 Hrs</br> Valid upto: 1305 Hrs </p>",
        "zoomLevel": 5,
        "scale": 0.5,
        "title": "Bapatla",
        "latitude": "15.905897",
        "longitude": "80.471587"
    },

I want to get the data regarding the information regarding “images” subsection. This is the code that I have written until now. However, I could not move forward. Could anybody please help?

import requests # This is a request to the website
from bs4 import BeautifulSoup # This is a parser

url = "https://mausam.imd.gov.in/imd_latest/contents/stationwise-nowcast-warning.php"
html = requests.get(url).content # requests instance
soup = BeautifulSoup(html, 'html.parser') # getting raw data
a = soup.find('script', attrs={'type': 'text/javascript'})

Advertisement

Answer

You are on the right track, you just need to further dissect the information from that tag, to get what you need. Here is one way of obtaining that data:

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
import json

url = 'https://mausam.imd.gov.in/imd_latest/contents/stationwise-nowcast-warning.php'
script_w_data = bs(requests.get(url).text, 'html.parser').select_one('script[type="text/javascript"]').text.split('"images": [')[1].split(']')[0]
obj = json.loads('[' + script_w_data + ']')
df = pd.json_normalize(obj)
print(df)

Result in terminal:

    imageURL    width   height  description zoomLevel   scale   title   latitude    longitude
0   nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Bapatla 15.905897   80.471587
1   nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Eluru   16.71066    81.09524
2   nowcast_marker/map-marker-icon-png-yellow.png   20  20  <p>Light rain: < 5 mm/hr</br> Light Thundersto...   5   0.5 Gannavaram  16.540171   80.801249
3   nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Guntur  16.306652   80.43654
4   nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Kakinada    16.945181   82.238647
... ... ... ... ... ... ... ... ... ...
1115    nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Namrup  27.12   95.18
1116    nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Nazira  26.54   94.44
1117    nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Moreh   24.2475 94.3045
1118    nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Moirang 24.5028 93.7768
1119    nowcast_marker/map-marker-icon-png-green.png    20  20  <p>No Warning </br></br> Time of issue: 2022-1...   5   0.5 Jhandutta   31.3702 76.6369
1120 rows × 9 columns

See pandas documentation at https://pandas.pydata.org/docs/

Also BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/

Advertisement