Python Automation Saves my Butt

in #pythonlast year

python-automation.jpg

There is an oft-told story about programmers that we will spend days programming a tool or utility to save ourselves an hour of manual labor.

It is true!

The thing that story won't reveal is when it comes down to it, that practice does come in very handy. Especially when you mess up like I did.

My Mistake

Once per month, I get sent some statistics from an extremely senior colleague which I dutifully copy and paste into a spreadsheet.

I could have automated this, in fact, I did research doing so, but instead each month I take the manual approach.

Or at least I did until things got busy, and people stopped asking for the data.

I am sure you can see where this is headed.

Yep, people started asking for the data again. Which I had not been inputting.

Enter Python!

Reading Gmail with Python

As mentioned, I did look into reading the data directly from my email which is hosted on Gmail.

To read these messages using Python, we use the Gmail API. The Gmail API allows you to interact with Gmail in a programmatic way, exactly what we need, but I decided against opening up my work email to a lazy-ass script.

For your reference though here is how it works ...

How to Read Gmail with Gmail API and Python

First, you will need to set up a project in the Google Cloud Console and enable the Gmail API.

Next, install the Google client library for Python:
pip3 install --upgrade google-api-python-client

We need to use OAuth 2.0 to authenticate our app and authorize access to the Gmail account.

Once authenticated, call threads().list() to retrieve a list of threads in the mailbox:

  • Iterate through the list of threads, and
  • for each thread, call the threads().get() method to retrieve the full thread.

The get() method returns a Thread object, which contains a list of Message objects. You can use the messages() method of the Thread object to iterate through the list of messages in the thread and process each message as needed.

from google.oauth2.credentials import Credentials

# Use the Google API client library to authenticate and authorize access
# to the Gmail API
creds = Credentials.from_authorized_user_info()

# Create a Gmail API service client
service = build('gmail', 'v1', credentials=creds)

# Call the Gmail API's threads().list() method to retrieve a list of threads
response = service.threads().list(q='label:inbox').execute()
threads = response['threads']

# Print the subject of each thread
for thread in threads:
    tdata = service.threads().get(userId='me', id=thread['id']).execute()
    msg = tdata['messages'][0]['payload']
    subject = ''
    for header in msg['headers']:
        if header['name'] == 'Subject':
            subject = header['value']
            break
    print(f'{thread["id"]}: {subject}')

Actual Python Solution (Read HTML)

So that would work but I would likely not be popular if I opened up my work email to the API. I considered forwarding the messages which made me realize there was a quicker solution ...

Save the thread as HTML!

All of the data was sent as replies to a single thread so I simply went ahead as if I wanted to print, but saved the HTML of the page instead.

Once I have the data wrapped in HTML it is a simple task to read the text, but even more convenient, we have the BeautifulSoup library to parse and interpret HTML files elegantly in Python.

from bs4 import BeautifulSoup

# Open the HTML file
with open("index.html") as f:
    soup = BeautifulSoup(f, "html.parser")

# Print the title of the HTML document
print(soup.title)

# Print the first paragraph
print(soup.p)

# Find all the links in the document
for link in soup.find_all("a"):
    print(link)

If you have an HTML string in Python and you want to parse it using BeautifulSoup, you can pass the string to the BeautifulSoup constructor like this:

from bs4 import BeautifulSoup

# The HTML string
html_string = "<html><head><title>My Title</title></head><body><p>Hello, World!</p></body></html>"

# Parse the HTML string
soup = BeautifulSoup(html_string, "html.parser")

# Print the title of the HTML document
print(soup.title)

# Print the first paragraph
print(soup.p)

While BS4 can "prettify" HTML, I want to read the numbers out of the data so instead I want to strip out the HTML!

BeautifulSoup can remove structural HTML such as tags, attributes, and comments, but it can't remove other types of formatting, such as bold or italic text.

To remove HTML formatting, you can use the get_text() method of the BeautifulSoup object. This method returns the text content of the HTML document, stripped of all HTML tags and formatting.

Here's an example of how you can use get_text():

from bs4 import BeautifulSoup

# The HTML string
html_string = "<html><head><title>My Title</title></head><body><p>Hello, <b>World</b>!</p></body></html>"

# Parse the HTML string
soup = BeautifulSoup(html_string, "html.parser")

# Print the text content of the HTML document
print(soup.get_text())

You can also use the stripped_strings generator to iterate over the text content of the HTML document, with leading and trailing white space stripped from each string:

for string in soup.stripped_strings:
    print(repr(string))

The repr() function is a built-in Python function that returns a string representation of a printable object. It's often used for debugging or logging purposes, because the resulting string is guaranteed to be a valid Python expression that can be used to recreate the object.

Once I have stripped text I can look for the key part of the messages, a list of lines starting with '-n '.

string = "This is a test string."
if "-n " in string:
    print("The string contains the substring '-n '.")
else:
    print("The string does not contain the substring '-n '.")

Now it is a simple job to output those lines, without the superfluous formatting or words, separated by commas, ready for loading into a Google Docs sheet!

Sort:  

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.