Skip to content
Advertisement

Map different column values with website context

I have a dataframe like this:

df1 = pd.DataFrame({   
    "index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else","exec sp_add_jobserver"],
    "index1": ["NaN",  "NaN",  "NaN", "exec sp_delete_job"],
    "index2": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "NaN", "something else"],
    "index3": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job"]
})
df1.head()

      index                 index1                  index2          index3
0   EXEC sp_delete_job       NaN                EXEC sp_droplogin   EXEC sp_droplogin
1   exec sp_add_job          NaN                EXEC sp_delete_job  EXEC sp_delete_job
2   something else           NaN                  NaN                   exec sp_add_job
3   exec sp_add_jobserver    exec sp_delete_job   something else    exec sp_delete_job

What I want is to map the columns values with their description from this site https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/system-stored-procedures-transact-sql?view=sql-server-ver15

So for example this value EXEC sp_droplogin can be mapped with the description from here https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-droplogin-transact-sql?view=sql-server-ver15

so the output will look like this:

    index
    0   Removes a SQL Server login. This prevents access to an instance of SQL Server under 
 that login name.
    1   EXEC sp_delete_job
    2   exec sp_add_job
    3   exec sp_delete_job
    4   exec sp_add_jobserver

And the same must be done with the other column values.

What is the best way to perform this? With BeautifulSoup?

Can you provide some ideas/direction/code etc?

Advertisement

Answer

You could call a function for each index entry and replace it with the results of a requests beautifulsoup lookup:

import pandas as pd
import requests
from bs4 import BeautifulSoup

def description(value):   
    name = value.split(' ')[1].replace('_', '-')
    url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
    req = requests.get(url)
    soup = BeautifulSoup(req.content, "html.parser")
    div = soup.find('div', class_="content")
    return [p.text for p in div.find_all('p')][3]


df = pd.DataFrame({   
    "index": ["EXEC sp_droplogin",  "EXEC sp_delete_job",  "exec sp_add_job", "exec sp_delete_job","exec sp_add_jobserver"],
})


df['index'] = df['index'].map(description)
print(df)

This would change your dataframe as follows:

                                                                                                  index
0  Removes a SQL Server login. This prevents access to an instance of SQL Server under that login name.
1                                                                                        Deletes a job.
2                                                     Adds a new job executed by the SQL Agent service.
3                                                                                        Deletes a job.
4                                                    Targets the specified job at the specified server.
  1. First take value e.g. EXEC sp_droplogin and split it on the space. Then take the second part sp_droplogin and replace any _ with - which is needed for the URL.

  2. Create a suitable URL based on name.

  3. Use requests.get() to obtain the corresponding HTML from the Microsoft site.

  4. Locate a <div class='content'> which holds the description.

  5. Inside that div, locate all the <p> elements and extract the text for each. The fourth entry holds the required text. Return that.

If there are None values, you would need to test for this and return a suitable value:

def description(value):
    if value:
        .........existing code......
    else:
        return "Not found"

For your updated example, I suggest you use a dictionary to keep the results of each request to avoid looking up the same value multiple times.

You can use .applymap() to run the function for all items in the dataframe.

Lastly, if value does not start with exec then just return the value unchanged (or whatever you prefer)

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

procedures = {}     # cache of results

def description(value):   
    if value.lower().startswith("exec "):
        name = value.lower().split(' ')[1].replace('_', '-')
        
        if name in procedures:  # already seen?
            return procedures[name]
        else:
            url = f"https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/{name}-transact-sql?view=sql-server-ver15"
            req = requests.get(url)
            soup = BeautifulSoup(req.content, "html.parser")
            div = soup.find('div', class_="content")
            text = [p.text for p in div.find_all('p')][3]
            procedures[name] = text
            return text
    else:
        return value


df = pd.DataFrame({   
    "index": ["EXEC sp_delete_job",  "exec sp_add_job", "something else", "exec sp_add_jobserver"],
    "index1": ["NaN", "NaN",  "NaN", "exec sp_delete_job"],
    "index2": ["EXEC sp_droplogin", "EXEC sp_delete_job", "NaN", "something else"],
    "index3": ["EXEC sp_droplogin", "EXEC sp_delete_job", "exec sp_add_job", "exec sp_delete_job"]
})

df = df.applymap(description)
print(df)
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement