I have a dataframe like this:
df1 = pd.DataFrame({ "index": ["EXEC sp_delete_job", "exec sp_add_job", "something else","exec sp_add_jobserver"], "index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"], "index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"], "index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"] }) df1.head() index index1 index2 index3 0 EXEC sp_delete_job NaN EXEC sp_droplogin EXEC sp_droplogin 1 exec sp_add_job NaN EXEC sp_delete_job EXEC sp_delete_job 2 something else NaN NaN exec sp_add_job 3 exec sp_add_jobserver exec sp_delete_job something else exec sp_delete_job
What I want is to map the columns values with their description from this site https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15
So for example this value EXEC sp_droplogin
can be mapped with the description from here
https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15
so the output will look like this:
index 0 Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name. 1 EXEC sp_delete_job 2 exec sp_add_job 3 exec sp_delete_job 4 exec sp_add_jobserver
And the same must be done with the other column values.
What is the best way to perform this? With BeautifulSoup?
Can you provide some ideas/direction/code etc?
Advertisement
Answer
You could call a function for each index
entry and replace it with the results of a requests
beautifulsoup
lookup:
import pandas as pd import requests from bs4 import BeautifulSoup def description(value): name = value.split(' ')[1].replace('_', '-') url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15" req = requests.get(url) soup = BeautifulSoup(req.content, "html.parser") div = soup.find('div', class_="content") return [p.text for p in div.find_all('p')][3] df = pd.DataFrame({ "index": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"], }) df['index'] = df['index'].map(description) print(df)
This would change your dataframe as follows:
index 0 Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name. 1 Deletes a job. 2 Adds a new job executed by the SQL Agent service. 3 Deletes a job. 4 Targets the specified job at the specified server.
First take
value
e.g.EXEC sp_droplogin
and split it on the space. Then take the second partsp_droplogin
and replace any_
with-
which is needed for the URL.Create a suitable URL based on
name
.Use
requests.get()
to obtain the corresponding HTML from the Microsoft site.Locate a
<div class='content'>
which holds the description.Inside that div, locate all the
<p>
elements and extract the text for each. The fourth entry holds the required text. Return that.
If there are None
values, you would need to test for this and return a suitable value:
def description(value): if value: .........existing code...... else: return "Not found"
For your updated example, I suggest you use a dictionary to keep the results of each request to avoid looking up the same value multiple times.
You can use .applymap()
to run the function for all items in the dataframe.
Lastly, if value
does not start with exec
then just return the value unchanged (or whatever you prefer)
import pandas as pd import numpy as np import requests from bs4 import BeautifulSoup procedures = {} # cache of results def description(value): if value.lower().startswith("exec "): name = value.lower().split(' ')[1].replace('_', '-') if name in procedures: # already seen? return procedures[name] else: url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15" req = requests.get(url) soup = BeautifulSoup(req.content, "html.parser") div = soup.find('div', class_="content") text = [p.text for p in div.find_all('p')][3] procedures[name] = text return text else: return value df = pd.DataFrame({ "index": ["EXEC sp_delete_job", "exec sp_add_job", "something else", "exec sp_add_jobserver"], "index1": ["NaN", "NaN", "NaN", "exec sp_delete_job"], "index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"], "index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"] }) df = df.applymap(description) print(df)