How to parse Google custom search javascript output in python?

Question

I am trying to fetch some articles from ACL website based on the keywords as input. The website is using google custom search API and the output of the API is a javascript object. How I can parse the returned object in python and fetch the article name, URL, and abstract of the research paper from the object. The script

Accepted Answer

response.text gives you string and if you remove /*O_o*/ngoogle.search.cse.api12760( at the beginning, and ); at the end then you will have normal JSON which you can convert to Python dictionary using json.loads() – and then you can use [key] to get data from dictionary.Minimal working exampleimport requestsimport jsonparams = ( ('rsz', 'filtered_cse'), ('num', '10'), ('hl', 'en'), ('source', 'gcsc'), ('gss', '.com'), ('cselibv', 'cc267ab8871224bd'), ('cx', '000299513257099441687:fkkgoogvtaw'), ('q', 'multi-label text classification'), ('safe', 'off'), ('cse_tok', 'AJvRUv1dd6NHqw5GKAoRSg3lLILE:1636278007905'), ('sort', ''), ('exp', 'csqr,cc,4618906'), ('callback', 'google.search.cse.api12760'),)response = requests.get('https://cse.google.com/cse/element/v1', params=params)start = len('''/*O_o*/google.search.cse.api12760(''')end = len(');')text = response.text[start:-end]data = json.loads(text)#print(data)for item in data['results']: #print('keys:', item.keys()) print('title:', item['title']) print('url:', item['url']) #print('content:', item['content']) #print('title:', item['titleNoFormatting']) #meta = item['richSnippet']['metatags'] #if 'author' in meta: # print('author:', meta['author']) print('---')Result:title: Large-Scale Multi-Label Text Classification on EU Legislation - ACL ...url: https://www.aclweb.org/anthology/P19-1636/---title: Label-Specific Document Representation for Multi-Label Text ...url: https://www.aclweb.org/anthology/D19-1044/---title: Initializing neural networks for hierarchical multi-label text ...url: https://www.aclweb.org/anthology/W17-2339---title: TaxoClass: Hierarchical Multi-Label Text Classification Using Only ...url: https://www.aclweb.org/anthology/2021.naacl-main.335/---title: NeuralClassifier: An Open-source Neural Hierarchical Multi-label ...url: https://www.aclweb.org/anthology/P19-3015/---title: Extreme Multi-Label Legal Text Classification: A Case Study in EU ...url: https://www.aclweb.org/anthology/W19-2209---title: Hierarchical Transfer Learning for Multi-label Text Classification ...url: https://www.aclweb.org/anthology/P19-1633/---title: Global Model for Hierarchical Multi-Label Text Classification - ACL ...url: https://www.aclweb.org/anthology/I13-1006---title: Hierarchical Multi-label Classification of Text with Capsule Networks ...url: https://www.aclweb.org/anthology/P19-2045---title: Improving Pretrained Models for Zero-shot Multi-label Text ...url: https://www.aclweb.org/anthology/2021.naacl-main.83.pdf---BTW:If you display item.keys() then you should see what else you can get:'cacheUrl', 'clicktrackUrl', 'content', 'contentNoFormatting', 'title', 'titleNoFormatting', 'formattedUrl', 'unescapedUrl', 'url', 'visibleUrl', 'richSnippet', 'breadcrumbUrl'Or you can use for-loop to display all keys and valuesfor item in data['results']: for key, value in item.items(): print(f'{key}: {value}') print('---') print('===================================')Some of them may have sub dictionaries – like item['richSnippet']['metatags']['author']

Advertisement

Answer