I am very new to this concept, but I am trying to learn how to use python to manipulate HTML data. I wrote a python (ver. 3.4.1) script which fetches the URL and returns some information, which I parse using BeautifulSoup (ver. 4). In this example, I am attempting to obtain the price of the Xbox One. I chose this
Tag: beautifulsoup
Extracting url from style: background-url: with beautifulsoup and without regex?
I have: I want to get the url, however I don’t know how to do that without the use of regex. Is it even possible? so far my solution with regex is: Answer You could try using the cssutils package. Something like this should work: Although you are ultimately going to need to parse out the actual url this method
Beautiful Soup 4: Remove comment tag and its content
The page that I’m scraping contains these HTML codes. How do I remove the comment tag <!– –> along with its content with bs4? Answer You can use extract() (solution is based on this answer): PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted. As a result you get your div
‘Show more results’ while scraping mobile details from flipkart
My question is same as Scraping all mobiles of Flipkart.com. I tried the solution given over there, but that change in the start variable is not working , and I can only scrape the starting twenty mobile information only. The initial value of start was 21, so increased to 50, but still I am getting the same result. Answer There
Beautifulsoup sibling structure with br tags
I’m trying to parse a HTML document using the BeautifulSoup Python library, but the structure is getting distorted by <br> tags. Let me just give you an example. Input HTML: HTML that BeautifulSoup interprets: In the source, the spans could be considered siblings. After parsing (using the default parser), the spans are suddenly no longer siblings, as the br tags
How to read a whole file in Python? To work universally in command line
How to read a whole file in Python? I would like my script to work however it is called script.py log.txt script.py < log2.txt python script.py < log2.txt python -i script.py logs/yesterday.txt You get the idea. I tried But I get Answer Instead of using fileinput, open the file directly yourself:
Don’t put html, head and body tags automatically, beautifulsoup
I’m using beautifulsoup with html5lib, it puts the html, head and body tags automatically: Is there any option that I can set, turn off this behavior ? Answer This parses the HTML with Python’s builtin HTML parser. Quoting the docs: Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml,
Extracting XML Attributes
I have an XML file with several thousand records in it in the form of: How can I convert this into a CSV or tab-delimited file? I know I can hard-code it in Python using re.compile() statements, but there has to be something easier, and more portable among diff XML file layouts. I’ve found a couple threads here about attribs,
How to find tag with particular text with Beautiful Soup?
How to find text I am looking for in the following HTML (line breaks marked with n)? The code below returns first found value, so I need to filter by “Fixed text:” somehow. UPDATE: If I use the following code: then it returns just Fixed text:, not the <strong>-highlighted text in that same element. Answer You can pass a regular
Extracting an attribute value with beautifulsoup
I am trying to extract the content of a single “value” attribute in a specific “input” tag on a webpage. I use the following code: I get TypeError: list indices must be integers, not str Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here… but I am no specialist, and I may have