I am trying to fetch some articles from ACL website based on the keywords as input. The website is using google custom search API and the output of the API is a javascript object.
How I can parse the returned object in python and fetch the article name, URL, and abstract of the research paper from the object.
The script I am using to fetch articles :
import requests params = ( ('rsz', 'filtered_cse'), ('num', '10'), ('hl', 'en'), ('source', 'gcsc'), ('gss', '.com'), ('cselibv', 'cc267ab8871224bd'), ('cx', '000299513257099441687:fkkgoogvtaw'), ('q', 'multi-label text classification'), ('safe', 'off'), ('cse_tok', 'AJvRUv1dd6NHqw5GKAoRSg3lLILE:1636278007905'), ('sort', ''), ('exp', 'csqr,cc,4618906'), ('callback', 'google.search.cse.api12760'), ) response = requests.get('https://cse.google.com/cse/element/v1', params=params) print(response.headers['Content-Type']) # 'application/javascript; charset=utf-8'
output looks like this:
'/*O_o*/ngoogle.search.cse.api12760({n "cursor": {n "currentPageIndex": 0,n "estimatedResultCount": "21600",n "moreResultsUrl": "http://www.google.com/cse?oe=utf8&ie=utf8&source=uds&q=multi-label+text+classification&safe=off&sort=&cx=000299513257099441687:fkkgoogvtaw&start=0",n "resultCount": "21,600",n "searchResultTime": "0.16",n "pages": [n {n "label": 1,n "start": "0"n },n {n "label": 2,n "start": "10"n },n {n "label":
Although the output in the network tab of chrome is JSON while initiating the search command:
How can I get articles along with their link from the js object in python?
Advertisement
Answer
response.text
gives you string and if you remove /*O_o*/ngoogle.search.cse.api12760(
at the beginning, and );
at the end then you will have normal JSON
which you can convert to Python dictionary using json.loads()
– and then you can use [key]
to get data from dictionary.
Minimal working example
import requests import json params = ( ('rsz', 'filtered_cse'), ('num', '10'), ('hl', 'en'), ('source', 'gcsc'), ('gss', '.com'), ('cselibv', 'cc267ab8871224bd'), ('cx', '000299513257099441687:fkkgoogvtaw'), ('q', 'multi-label text classification'), ('safe', 'off'), ('cse_tok', 'AJvRUv1dd6NHqw5GKAoRSg3lLILE:1636278007905'), ('sort', ''), ('exp', 'csqr,cc,4618906'), ('callback', 'google.search.cse.api12760'), ) response = requests.get('https://cse.google.com/cse/element/v1', params=params) start = len('''/*O_o*/ google.search.cse.api12760(''') end = len(');') text = response.text[start:-end] data = json.loads(text) #print(data) for item in data['results']: #print('keys:', item.keys()) print('title:', item['title']) print('url:', item['url']) #print('content:', item['content']) #print('title:', item['titleNoFormatting']) #meta = item['richSnippet']['metatags'] #if 'author' in meta: # print('author:', meta['author']) print('---')
Result:
title: Large-Scale <b>Multi</b>-<b>Label Text Classification</b> on EU Legislation - ACL ... url: https://www.aclweb.org/anthology/P19-1636/ --- title: <b>Label</b>-Specific Document Representation for <b>Multi</b>-<b>Label Text</b> ... url: https://www.aclweb.org/anthology/D19-1044/ --- title: Initializing neural networks for hierarchical <b>multi</b>-<b>label text</b> ... url: https://www.aclweb.org/anthology/W17-2339 --- title: TaxoClass: Hierarchical <b>Multi</b>-<b>Label Text Classification</b> Using Only ... url: https://www.aclweb.org/anthology/2021.naacl-main.335/ --- title: NeuralClassifier: An Open-source Neural Hierarchical <b>Multi</b>-<b>label</b> ... url: https://www.aclweb.org/anthology/P19-3015/ --- title: Extreme <b>Multi</b>-<b>Label</b> Legal <b>Text Classification</b>: A Case Study in EU ... url: https://www.aclweb.org/anthology/W19-2209 --- title: Hierarchical Transfer Learning for <b>Multi</b>-<b>label Text Classification</b> ... url: https://www.aclweb.org/anthology/P19-1633/ --- title: Global Model for Hierarchical <b>Multi</b>-<b>Label Text Classification</b> - ACL ... url: https://www.aclweb.org/anthology/I13-1006 --- title: Hierarchical <b>Multi</b>-<b>label Classification</b> of <b>Text</b> with Capsule Networks ... url: https://www.aclweb.org/anthology/P19-2045 --- title: Improving Pretrained Models for Zero-shot <b>Multi</b>-<b>label Text</b> ... url: https://www.aclweb.org/anthology/2021.naacl-main.83.pdf ---
BTW:
If you display item.keys()
then you should see what else you can get:
'cacheUrl', 'clicktrackUrl', 'content', 'contentNoFormatting', 'title', 'titleNoFormatting', 'formattedUrl', 'unescapedUrl', 'url', 'visibleUrl', 'richSnippet', 'breadcrumbUrl'
Or you can use for-loop to display all keys and values
for item in data['results']: for key, value in item.items(): print(f'{key}: {value}') print('---') print('===================================')
Some of them may have sub dictionaries – like item['richSnippet']['metatags']['author']