Logging S3 Download Progress with Python and Boto3

Logging S3 Download Progress with Python and Boto3

This post demonstrates how to log the download progress of an S3 object at self-defined intervals using Python's built-in logger and without any additional third-party libraries (besides boto3).

At the time of this writing I'm using boto3 version 1.18.2 and Python version 3.9.1. The full source code for this example can be found in the following GitHub repository: boto3-download-progress-example

Introduction

At work I've been refactoring some legacy Python code that involves downloading video files from Amazon Simple Storage Service.  One of the goals of this refactor was to add a significant amount of logging so that this service would be easier to debug.

This included logging the progress of S3 file downloads since these files were usually pretty large and I wanted to make sure I could figure out how long they were taking to download, if/when the download stalled, etc.

When I searched for a solution online, I came across the boto3 documentation which only had an example for uploads (and actually contains an error). I also found a few StackOverflow answers that were helpful (such as this answer) but the problem was that it displayed the output as a loading bar on a single line, which wouldn't be compatible with the SaaS platform that this service sends logs to (Datadog in this case).

The requirements I set for my ideal solution were:

  • each log must appear on its own line
  • the percentage intervals to log are easily configurable
  • no additional third-party libraries necessary

Since none of the solutions I found covered all of these bases, I decided to write this post.

Environment Setup

In order to properly test this app, I've uploaded a ~200MB sample video file named example-file.mp4 to an S3 Bucket named logging-example. Although 200MB isn't that large for a video file, it's large enough to demonstrate the desired outcome.

The Amazon S3 dashboard.

I've also installed the boto3 Python library and configured it by exporting my AWS security credentials, which can be done using the following commands:

pip install -Iv boto3==1.18.2
export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXX
export AWS_SECRET_ACCESS_KEY=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

The App

This section covers the basic outline of the app before the S3 download progress logic is added. Create a new file called example.py and insert the following code into it:

import logging
import os

import boto3


s3 = boto3.resource(
    's3',
    aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
)

def download_file_from_s3(bucket, key, download_path):
    remote_file = s3.Bucket(bucket).Object(key)
    logging.info(f"Starting download for '{remote_file.key}'")
    remote_file.download_file(download_path)
    logging.info(f"Finished download for '{remote_file.key}'")


if __name__ == "__main__":
    logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
    download_file_from_s3('logging-example', 'example-file.mp4', '/tmp/downloaded-file.mp4')

The top of the file imports the necessary modules and then instantiates an s3 resource using boto3. There are several ways to authentic with boto3 (see their configuring credentials documentation) but the quick and easy way for this example is to use the environment variables set in the previous section.

Next, the download_file_from_s3 function is defined which takes as parameters an S3 Bucket name, a file name, and a download path.

Hey, Kid!

The end of the file checks if this file is being run as a standalone Python executable. If so, it configures a basic logger at the INFO level (see the logging documentation) and calls the download_file_from_s3 function, passing in a hard-coded bucket name, file name, and download destination. These names must match the bucket and file names chosen in the 'Environment Setup' section.

Run this app from the command line using python example.py. The output should look as such:

(venv) christopherhranj@~/Documents/boto3-download-progress-example (main)$ python example.py 
INFO: Starting download for 'example-file.mp4'
INFO: Finished download for 'example-file.mp4'

This is perfectly functional, but the app hangs for a minute while the large file is downloaded and there's no way to tell if download progress has stalled or how far along it is. This can be improved.

Adding the Download Progress Logger

In the same file, add the following class below the imports:

class S3DownloadLogger(object):
    def __init__(self, file_size, filename):
        self._filename = filename
        self._size = file_size
        self._seen_so_far = 0
        self._seen_percentages = dict.fromkeys([25, 50, 75], False)  # Define intervals here.

    def __call__(self, bytes_amount):
        self._seen_so_far += bytes_amount
        percentage = round((self._seen_so_far / self._size) * 100)
        if percentage in self._seen_percentages.keys() and not self._seen_percentages[percentage]:
            self._seen_percentages[percentage] = True
            logging.info(f"Download progress for '{self._filename}': {percentage}%")

This class is mostly straight-foward, but there is one important instance variable worth covering. self._seen_percentages is a dictionary that specifies the percentage intervals to log at. This is structured as a dictionary that defaults to False because, depending on the size of the file being downloaded, boto3's Callback may run more than once per every 1% downloaded. Without a boolean to keep track of the percentages that have already been seen, the same percentage could get logged multiple times. If self._seen_percentages was simply a list, the output for this example would look as such:

(venv) christopherhranj@~/Documents/boto3-download-progress-example (main)$ python example.py 
INFO: Starting download for 'example-file.mp4'
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 50%
...

The last step is to update the download_file_from_s3 to instantiate an S3DownloadLogger object and pass it into boto3's download_file method as the Callback argument. Replace the entire function with the updated snippet below:

def download_file_from_s3(bucket, key, download_path):
    remote_file = s3.Bucket(bucket).Object(key)
    logging.info(f"Starting download for '{remote_file.key}'")
    download_logger = S3DownloadLogger(remote_file.content_length, remote_file.key)
    remote_file.download_file(download_path, Callback=download_logger)
    logging.info(f"Finished download for '{remote_file.key}'")

Don't forget to save after updating the file.

Testing

It's time to test the updated code. Run the script again using python example.py and it should produce the following output:

(venv) christopherhranj@~/Documents/boto3-logging-example$ python example.py 
INFO: Starting download for 'example-file.mp4'
INFO: Download progress for 'example-file.mp4': 25%
INFO: Download progress for 'example-file.mp4': 50%
INFO: Download progress for 'example-file.mp4': 75%
INFO: Finished download for 'example-file.mp4'

Wrapping Up

That's all the code you need. I hope this post was helpful and easy to follow. If you have any questions or feedback please reach out to me @brodan_. Thank you!

Original Log Commercial | The Ren & Stimpy Show