Trying to find unique subarrays and sub-elements?

I have an array will subarrays containing [page_name, url, and id] in dirty_pages. This array contain duplicate subarrays.

I need to parse each subarray in dirty_pages into clean_pages such that:

there are no duplicates (repeating subarray)
the 1st index in the subarray i.e. the url must be unique! For example This url should be counted as one (url/#review is still the same url):
```
file:///home/joe/Desktop/my-projects/FashionShop/product.html#review
```
and
```
file:///home/joe/Desktop/my-projects/FashionShop/product.html
```

My current attempt returns clean_pages with 6 subarrays (duplicates!) while the correct answer should be 4

# clean pages
clean_pages = []


# dirty pages
dirty_pages = [
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
   ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/?123', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/', '1608093980462.042'],
    ]




# clean data - get unique pages for each session
for j in range(len(dirty_pages)):
    page_name = dirty_pages[j][0]
    page_url = dirty_pages[j][1]
    page_sessionId = dirty_pages[j][2]

    not_seen = False

    if len(clean_pages) == 0:
        clean_pages.append([page_name, page_url, page_sessionId])
    else:
        for i in range(len(clean_pages)):
            next_page_name = clean_pages[i][0]
            next_page_url = clean_pages[i][1]
            next_page_sessionId = clean_pages[i][2]

            if page_url != next_page_url and page_name != next_page_name 
                    and page_sessionId == next_page_sessionId:
                not_seen = True
            else:
                not_seen = False

    if not_seen is True:
        clean_pages.append([page_name, page_url, page_sessionId])

print("$$$ clean...", len(clean_pages))

# correct answer should be 4 - as anyting after url e.g. #review is still duplicate!

UPDATE EXAMPLE – Apologies if example wasn’t clear (just like # after url these should be considered one url)

'file:///home/joe/Desktop/my-projects/FashionShop/index.html/'

'file:///home/joe/Desktop/my-projects/FashionShop/index.html/?123'

'file:///home/joe/Desktop/my-projects/FashionShop/index.html'

Answer

You can use furl for normalizing the urls

from furl import furl

# Iterate over each page - subarray
for page in dirty_pages:
    # normalize url
    page[1] = furl(page[1]).remove(args=True, fragment=True).url.strip("/")

    # check if subarray already in clean_pages
    if page not in clean_pages:
        clean_pages.append(page)

Advertisement

Answer