I have an array will subarrays containing [page_name, url, and id] in dirty_pages. This array contain duplicate subarrays.
I need to parse each subarray
in dirty_pages
into clean_pages
such that:
there are no duplicates (repeating subarray)
the
1st index
in the subarray i.e. the url must be unique! For example This url should be counted as one (url/#review
is still the same url):file:///home/joe/Desktop/my-projects/FashionShop/product.html#review
and
file:///home/joe/Desktop/my-projects/FashionShop/product.html
My current attempt returns clean_pages
with 6 subarrays (duplicates!) while the correct answer should be 4
# clean pages clean_pages = [] # dirty pages dirty_pages = [ ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'], ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'], ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'], ['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'], ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/#review', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/?123', '1608093980462.042'], ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/', '1608093980462.042'], ] # clean data - get unique pages for each session for j in range(len(dirty_pages)): page_name = dirty_pages[j][0] page_url = dirty_pages[j][1] page_sessionId = dirty_pages[j][2] not_seen = False if len(clean_pages) == 0: clean_pages.append([page_name, page_url, page_sessionId]) else: for i in range(len(clean_pages)): next_page_name = clean_pages[i][0] next_page_url = clean_pages[i][1] next_page_sessionId = clean_pages[i][2] if page_url != next_page_url and page_name != next_page_name and page_sessionId == next_page_sessionId: not_seen = True else: not_seen = False if not_seen is True: clean_pages.append([page_name, page_url, page_sessionId]) print("$$$ clean...", len(clean_pages)) # correct answer should be 4 - as anyting after url e.g. #review is still duplicate!
UPDATE EXAMPLE – Apologies if example wasn’t clear (just like # after url these should be considered one url)
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/' 'file:///home/joe/Desktop/my-projects/FashionShop/index.html/?123' 'file:///home/joe/Desktop/my-projects/FashionShop/index.html'
Advertisement
Answer
You can use furl for normalizing the urls
from furl import furl # Iterate over each page - subarray for page in dirty_pages: # normalize url page[1] = furl(page[1]).remove(args=True, fragment=True).url.strip("/") # check if subarray already in clean_pages if page not in clean_pages: clean_pages.append(page)