I have an array will subarrays containing [page_name, url, and id] in dirty_pages. This array contain duplicate subarrays.
I need to parse each subarray
in dirty_pages
into clean_pages
such that:
there are no duplicates (repeating subarray)
the
1st index
in the subarray i.e. the url must be unique! For example This url should be counted as one (url/#review
is still the same url):JavaScriptx21file:///home/joe/Desktop/my-projects/FashionShop/product.html#review
2
and
JavaScript121file:///home/joe/Desktop/my-projects/FashionShop/product.html
2
My current attempt returns clean_pages
with 6 subarrays (duplicates!) while the correct answer should be 4
JavaScript
1
54
54
1
# clean pages
2
clean_pages = []
3
4
5
# dirty pages
6
dirty_pages = [
7
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
8
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
9
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
10
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
11
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
12
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
13
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
14
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
15
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
16
['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
17
['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
18
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/#review', '1608093980462.042'],
19
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/?123', '1608093980462.042'],
20
['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/', '1608093980462.042'],
21
]
22
23
24
25
26
# clean data - get unique pages for each session
27
for j in range(len(dirty_pages)):
28
page_name = dirty_pages[j][0]
29
page_url = dirty_pages[j][1]
30
page_sessionId = dirty_pages[j][2]
31
32
not_seen = False
33
34
if len(clean_pages) == 0:
35
clean_pages.append([page_name, page_url, page_sessionId])
36
else:
37
for i in range(len(clean_pages)):
38
next_page_name = clean_pages[i][0]
39
next_page_url = clean_pages[i][1]
40
next_page_sessionId = clean_pages[i][2]
41
42
if page_url != next_page_url and page_name != next_page_name
43
and page_sessionId == next_page_sessionId:
44
not_seen = True
45
else:
46
not_seen = False
47
48
if not_seen is True:
49
clean_pages.append([page_name, page_url, page_sessionId])
50
51
print("$$$ clean...", len(clean_pages))
52
53
# correct answer should be 4 - as anyting after url e.g. #review is still duplicate!
54
UPDATE EXAMPLE – Apologies if example wasn’t clear (just like # after url these should be considered one url)
JavaScript
1
6
1
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/'
2
3
'file:///home/joe/Desktop/my-projects/FashionShop/index.html/?123'
4
5
'file:///home/joe/Desktop/my-projects/FashionShop/index.html'
6
Advertisement
Answer
You can use furl for normalizing the urls
JavaScript
1
11
11
1
from furl import furl
2
3
# Iterate over each page - subarray
4
for page in dirty_pages:
5
# normalize url
6
page[1] = furl(page[1]).remove(args=True, fragment=True).url.strip("/")
7
8
# check if subarray already in clean_pages
9
if page not in clean_pages:
10
clean_pages.append(page)
11