A Bug With Caching Python Virtualenvs on Self-Hosted GitHub Runners

If you are at a company which owns their own hardware to run GitHub self-hosted runners, you know this can save your company a significant amount of money. However, there are a fair amount of nuances to watch out for. In this post, one thing to watch out for when restoring a Python virtualenv from cache.

Context

For context, the specific setup I’m talking about is running multiple self-hosted GitHub runners directly on a single host, not isolated from each other in any way besides being in their own runner directories (e.g., no VM per runner). This setup can be a nice performance win because it makes it easier to share cache and data.

As a brief sidebar, many people running this type of setup don’t actually use explicit caching at all. For example, where usage of actions/cache is common for GitHub-hosted runners, it is glacially slow for self-hosted runners. So much so that there are alternatives, e.g., buildjet/cache or whywaita/action-cache-s3, which do provide good network performance.

This story is about actually using one of those caches.

We turned on buildjet/cache and were impressed with the performance. But, then found all sorts of odd bugs in our workflows, that felt like environment issues (e.g., ‘module not found’ on imports that should work).

The Issue

The root issue is that when tools (e.g., pytest) are installed by poetry/pip, they are prefixed with a shebang that has a full path to Python. More specifically, suppose you store your virtualenv in .venv, then every script in .venv/bin/ will have this shebang (including activate, meaning this will mess up poetry shell if you use Poetry!). For example:

./.venv/bin/pytest
1:#!/data/github/actions-runner-6/_work/repo/.venv/bin/python

We see here that it was actions-runner-6 which wrote the cache. Then, when a different runner, e.g., actions-runner-1 comes along and hits the cache to restore its virtualenv, when it tries to invoke pytest, things will likely fail, since that binary is calling python in a different environment.

A Hacky Solution

There might be a better approach (and I’d love to hear it!), but one hacky solution is to simply add a step in your workflow to rewrite those shebangs if the cache was hit. One way to do so (in a more readable way than awk/sed/etc.) is to use ripgrep and rep.

For example:

...
    - name: Load cached venv
      id: cached-poetry-deps
      uses: buildjet/cache@v4
      with:
        path: .venv
        key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }
    - name: Correct the .venv/bin paths
      if: steps.cached-poetry-deps.outputs.cache-hit == 'true'
      run: |
        rg "#!" .venv/bin/ --no-ignore -n | rep "#!.+\$" "#!$(pwd)/.venv/bin/python" -w
...

Posts from blogs I follow

In Response To Google

Google has chosen to send a response to my article to Barry Schwartz of Search Engine Roundtable. Here is my response. (1) On the March 2019 core update claim in the piece: This is baseless speculation. The March 2019 core update was designed to improve th…

via Ed Zitron's Where's Your Ed At April 25, 2024

Day 11: Visual improvements to the order view

Yesterday, a new release of Dashify went live! The update included improved styles to the order page. The action buttons in the top right corner are more consistent with each other, and the products, taxes, and other line items are separated instead of all…

via John Jago April 25, 2024

All we have to fear is FUD itself

The Oxide Friends have talked about the Hashicorp license change, the emergence of an open source fork of Terraform in OpenTofu, and other topics in open source. A few weeks ago both InfoWorld and Hashicorp (independently?) accused OpenTofu of stealing Ter…

via Oxide and Friends April 25, 2024

Generated by openring-rs from my blogroll.