I wrote some beautifulsoup scripts, and one part seems really redundant, I am thinking if it can be simplified with Regex.
All posts from this forum are marked with different colors, what I did is to search each color with one line. For six colors I did six lines with only one words difference.
JavaScript
x
7
1
red = soup.find_all('a', style="font-weight: bold;color: red")
2
blue = soup.find_all('a', style="font-weight: bold;color: blue")
3
green = soup.find_all('a', style="font-weight: bold;color: green")
4
purple = soup.find_all('a', style="font-weight: bold;color: purple")
5
orange = soup.find_all('a', style="font-weight: bold;color: orange")
6
lime = soup.find_all('a', style="color: green")
7
I am not sure if it is possible to be simplified. Maybe something like:
JavaScript
1
2
1
re.compile("(color: red|blue|green|purple|orange)", re.(whatever the letter is))
2
if it’s not regex, or could it be something else?
This is partial DOM:
JavaScript
1
85
85
1
<th class="common">
2
<label>
3
<img alt="" src="images/green001/agree.gif"/>
4
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
5
</label>
6
<em>[<a href="forumdisplay.php?fid=230&filter=type&typeid=140">美臀</a>]</em> <span id="thread_10431427"><a href="thread-10431427-1-1.html" style="font-weight: bold;color: blue">(本中)(HND-???) 二宮ひかり</a></span>
7
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
8
</th>
9
<td class="author">
10
<cite>
11
<a href="space.php?action=viewpro&uid=12737809">第一會所新片</a><img align="absmiddle" border="0" src="images/thankyou.gif"/>6 </cite>
12
<em>2019-4-22</em>
13
</td>
14
<td class="nums"><strong>2</strong> / <em>12234</em></td>
15
<td class="nums">5.02G / MP4
16
</td>
17
<td class="lastpost">
18
<em><a href="redirect.php?tid=10431427&goto=lastpost#lastpost">2019-4-23 20:22</a></em>
19
<cite>by <a href="space.php?action=viewpro&username=zj376104288">zj376104288</a></cite>
20
</td>
21
</tr>
22
</tbody><!-- 三級置頂分開 -->
23
<!-- 三級置頂分開 -->
24
<tbody id="stickthread_10431424">
25
<tr>
26
<td class="folder"><a href="thread-10431424-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_common.gif"/></a></td>
27
<td class="icon">
28
</td>
29
<th class="common">
30
<label>
31
<img alt="" src="images/green001/agree.gif"/>
32
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
33
</label>
34
<em>[<a href="forumdisplay.php?fid=230&filter=type&typeid=1303">VR</a>]</em> <span id="thread_10431424"><a href="thread-10431424-1-1.html" style="font-weight: bold;color: red">(WAAP)(WPVR-???)葵百合香</a></span>
35
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
36
</th>
37
<td class="author">
38
<cite>
39
<a href="space.php?action=viewpro&uid=12737809">第一會所新片</a><img align="absmiddle" border="0" src="images/thankyou.gif"/>5 </cite>
40
<em>2019-4-22</em>
41
</td>
42
<td class="nums"><strong>0</strong> / <em>7265</em></td>
43
<td class="nums">3.85G / MP4
44
</td>
45
<td class="lastpost">
46
<em><a href="redirect.php?tid=10431424&goto=lastpost#lastpost">2019-4-22 20:57</a></em>
47
<cite>by <a href="space.php?action=viewpro&username=%B5%DA%D2%BB%95%FE%CB%F9%D0%C2%C6%AC">第一會所新片</a></cite>
48
</td>
49
</tr>
50
</tbody><!-- 三級置頂分開 -->
51
<!-- 三級置頂分開 -->
52
<tbody id="stickthread_10431423">
53
<tr>
54
<td class="folder"><a href="thread-10431423-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_common.gif"/></a></td>
55
<td class="icon">
56
</td>
57
<th class="common">
58
<label>
59
<img alt="" src="images/green001/agree.gif"/>
60
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
61
</label>
62
<em>[<a href="forumdisplay.php?fid=230&filter=type&typeid=1303">VR</a>]</em> <span id="thread_10431423"><a href="thread-10431423-1-1.html" style="font-weight: bold;color: red">(KMP)(SAVR-???)舞島あかり</a></span>
63
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
64
</th>
65
<td class="author">
66
<cite>
67
<a href="space.php?action=viewpro&uid=12737809">第一會所新片</a><img align="absmiddle" border="0" src="images/thankyou.gif"/>4 </cite>
68
<em>2019-4-22</em>
69
</td>
70
<td class="nums"><strong>0</strong> / <em>6226</em></td>
71
<td class="nums">23.39G / MP4
72
</td>
73
<td class="lastpost">
74
<em><a href="redirect.php?tid=10431423&goto=lastpost#lastpost">2019-4-22 20:57</a></em>
75
<cite>by <a href="space.php?action=viewpro&username=%B5%DA%D2%BB%95%FE%CB%F9%D0%C2%C6%AC">第一會所新片</a></cite>
76
</td>
77
</tr>
78
</tbody><!-- 三級置頂分開 -->
79
<!-- 三級置頂分開 -->
80
<tbody id="stickthread_10431422">
81
<tr>
82
<td class="folder"><a href="thread-10431422-1-1.html" target="_blank" title="新窗口打开"><img src="images/green001/folder_common.gif"/></a></td>
83
<td class="icon">
84
</td>
85
Advertisement
Answer
You can pass a attribute list to css select with ends with operator
JavaScript
1
2
1
[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']
2
So,
JavaScript
1
2
1
items = [item for item in soup.select("[style$='color: red'],[style$='color: green'],[style$='color: blue'],[style$='color: purple'],[style$='color: orange']")
2