Skip to content
Advertisement

Get rid of default text

I am trying to parse a user’s event descriptions after obtaining access to their google calendar through the google calendar API. When I input the description into my program, I want to get rid of default (and useless) text such as Zoom meeting invitations. If the following below is the description string

<br>Hi, please keep this text.<br>

<br>Bob is inviting you to a scheduled Zoom meeting.<br>

<br>Topic: Bob's Personal Meeting Room<br>
<br>Join Zoom Meeting<br>
<a href="https://us04web.zoom.us/j/4487518794?pwd=SkdaTE9nV3E1M3FSaWlOHYvNGlndz09">https://us04web.zoom.us/j/4487518794?pwd=SkdaTE9nV3E1M3FSaWplOHYvNGlndz09</a><br>
<br>Meeting ID: 448 751 8#94<br>
Password: 1F9W2P<br>

<br>Also not Zoom default text.

How can I parse it so that only “Hi, please keep this test. Also not Zoom default text” remains?

Advertisement

Answer

Methodology

I think this would be a good use for Regular Expressions or RegEx. This is essentially a pattern-matching standard that allows for generalizing a certain structure in a string. While use in HTML and XML is not a good idea as it is not designed to extract any information you may be looking for, it should work if all you want to do is discard certain sections.

Explanation

If I understand correctly, you would like to be left with

<br>Hi, please keep this text.<br>

<br>Also not Zoom default text.<br>

Which means we need to come up with a pattern to match the following portion(the brackets indicating the information that will swap every time):

<br>[Name] is inviting you to a scheduled Zoom meeting.<br>

<br>Topic: [Name]'s Personal Meeting Room<br>
<br>Join Zoom Meeting<br>
<a href="[Link]">[Link]</a><br>
<br>Meeting ID: [ID]<br>
Password: [Password]<br>

Important Pieces:

  • The beginning: [Name] will be some string of at least one character. To make sure you don’t match <br>Hi, please keep this text.<br>, the part we want to match any characters that aren’t “<br>” (this is represented in regex with [^(?:<br>)]), where “character” means anything other than a line break. The rest of the sentence should be matched word for word, so we’re not just matching anything.

  • The end: [Password], like [Name], is just [^(?:<br>)] for the same reason.

  • This string starts and ends with “<br>”. This should be reflected in the regex

  • Everything between that first sentence and the password portion, even though they have a format, they are wildcards, some mix of at least one character or linebreak (represented in regex with (.|n)+)

Replacing all of the appropriate portions in the text, you get the following:

<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>

Code

As for the Python, the re module will come in handy here as your regex aid: We want to save the above pattern into a variable, and use the information to cut the appropriate portion out of the string.

To “save” the pattern, the re module allows you to compile the regex into an object (the r before the string indicates that it contains regex)

import re
zoom_pattern = re.compile(r"<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>")

The module also provides the ability to split replace regex matches within strings, and we can replace our match with nothing to cut it out of the string:

import re
s = " - string with zoom meeting stuff - "
zoom_pattern = re.compile(r"<br>[^(?:<br>)]+? is inviting you to a scheduled Zoom meeting.+?Password: [^(?:<br>)]+?<br>")

clean_string = zoom_pattern.sub("", s)

Since we compiled the pattern, you now have a reusable way to clean up your string!

If you’d like to change your regex to match each individual thing, just adjust the “Important points” from earlier to match your goal. If you want to test your ideas, this is a wonderful resource!

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement