The Crystal Programming Language Forum

How I migrated Athena to a Monorepo...and you can too

How I migrated Athena to a Monorepo…and you can too

Athena has a fairly large ecosystem, with 9 different components at the time of writing, that can only increase over time. Each of these components can be used independently and as such needs to be its own shard with its own version. As more and more components are added to the ecosystem, the overhead grows exponentially in regards to cross component dependency management, keeping development dependencies and CI scripts in-sync, and time spent jumping between each repository during development. It is for these reasons I decided to migrate Athena to use a monorepo.

Due to Shard’s “1 repository per shard requirement”, this is a bit easier said than done. The few workarounds that do exist, mentioned in part in this PR, such as dedicated branches, Git submodules, or singleton monorepo (monorepo with multiple entry points) are still less than ideal as they are fighting against Git itself just to get things working with Shards.

A Solution

However, I recently watched a talk in which developers from the Symfony project dealt with the same challenges: 1) wanting to use a single repository for development with many independent components, and 2) the composer package manager also requires 1 repository per package. The tl;dr of the talk was that they use a Git feature called subtrees, which is an alternative to submodules that essentially allows nesting one repository inside of another, supporting both push and pull workflows. A key point in what they did is not try and make the monorepo replace the many repositories for each component, but make it the source of truth of the components.

The push feature of Git subtrees is the key part of this. The idea is that you have a monorepo that contains one or more subtreed repositories within it. When a commit is pushed or a PR is merged, you run a git subtree push for each child repository that will push the changes from the monorepo to the child read-only repository. This way you can reap the benefits of a monorepo, while still having the flexibility of the many repositories.

In the end I created GitHub - crystal-manyrepos/root: Example repo/org to demonstrate Monorepo to ManyRepo syncing as an example of what things would ultimately look like, as well as documenting my steps, experiences, and findings during the migration process. I am also open to PRs to improve the documentation, or for updates to improve the overall UX of the process. Otherwise feel free to continue on to see how I handled the Athena migration.

Shout out to @watzon for the initial spark that set all of this in motion.

The Migration

Going into the migration I had a few requirements:

  1. Use the athena repository as the monorepo such that it retains its star count and links are still valid.
    1. This makes the most sense since Athenarepresents the ecosystem while Athena Framework is the framework you use created from the various components. If only I could get the athena GitHub name…
  2. Retain individual commit history from subtreed components.
    1. This ended up not being possible. It is possible, but would require rewriting the entire history of each sub component, in which case would break every release since the commit hashes would change. This is one of the gotchas I pointed out in the example repository. If you want the view the full history, you can always just go look at the child repository as it would have the commits from both before and after being added to the monorepo.
  3. Be backwards compatible with existing installations.
    1. Don’t really want to break those currently use the framework as it would be a global breakage and not something semver could protect against.

Having this plan in mind, I set off to start the migration process.

Setting up the Monorepo

Before anything else could advance, I needed to move the current framework code that resided in the athena repository into one of its own. I ended up going with framework to keep the naming scheme consistent. This new repository is a 1:1 clone of the old one created via mirroring it. It has the same commit hashes, the same tags, the same branches, etc. The tags on the athena repo are retained for backwards compatibility, but will NOT include new tags going forward. As such users will need to update their Athena dependency to github: athena-framework/framework, with everything else being the same.

Now that we have the child repository backup created for the framework, I moved onto preparing the athena repository to become the monorepo, starting by checking out a new branch. I then essentially deleted almost everything within it. In the end its structure now looks like:

.
├── .editorconfig
├── .github
│   └── workflows
│       ├── ci.yml
│       └── sync.yml
├── .gitignore
├── LICENSE
├── README.md
├── scripts
│   └── sync.sh
├── shard.dev.yml
├── shard.yml
└── src
    ├── components
    └── website

The src/ directory is also entirely changing its structure so it needs to accommodate each component, as well as leave room for more “meta” repositories, such as the Athena website.

Subtreeing Components

At this point the monorepo is ready to have the components added to it. I wrote a quick shell script to make this process a bit easier;

 function migrate()
{
  git subtree add --squash --message="Add the $1 component" --prefix="src/components/$1" $2 master
}

migrate "config" git@github.com:athena-framework/config.git
migrate "console" git@github.com:athena-framework/console.git
migrate "dependency_injection" git@github.com:athena-framework/dependency-injection.git
migrate "event_dispatcher" git@github.com:athena-framework/event-dispatcher.git
migrate "framework" git@github.com:athena-framework/framework.git
migrate "negotiation" git@github.com:athena-framework/negotiation.git
migrate "serializer" git@github.com:athena-framework/serializer.git
migrate "spec" git@github.com:athena-framework/spec.git
migrate "validator" git@github.com:athena-framework/validator.git

git subtree add --squash --message="Add the website" --prefix="src/website" git@github.com:athena-framework/website.git master

This script defines a function that will squash subtree in the provided Git repositories from the master branch. It adds it to the code base under the src/components/ directory with a nicer commit message saying which component was being added. I then did another one-off add for the website, given it goes in a different directory.

While this repository is not intended to be used as a Shard, it does still have a shard.yml that defines dependencies on each component as well as ameba as a development dependency. The main reason for this is so that we can leverage Shards overrides via shard.dev.yml. This file defines the same dependencies, but uses the path install method, pointing the related sub directory under src/components. The reasoning behind this is to make testing easier as well as removing the need to run shards install on each component.

Setting up CI

CI is going to need to be setup a bit differently than before given there are more than one repositories that may need to be executed. I say may because there aren’t (currently?) any integration tests between them. As such, each component can have its tests executed only if changes are made within that shard. Other parts of CI, such as Ameba or the formatter can just run against everything as they are performant enough.

Because the specs are not longer being executed from the root of the project, I needed to go through and add some __DIR__ constants such that file paths to fixture files are resolved correctly. But let me tell you, it was a joy not needing to bounce between different repositories, branches, and dependency updates while doing it!

Ended up coming up with a pretty simple shell script to start off with that will iterate over component directories, check if any changes happened within that component, and if so, run that component’s specs:

#!/usr/bin/env bash

for component in $(find src/components/ -maxdepth 2 -type f -name shard.yml | xargs -I{} dirname {} | sort); do
  git diff --quiet --exit-code $BASE_SHA $GITHUB_SHA -- $component
  HAS_COMPONENT_CHANGED=$?
  if [[ $GITHUB_EVENT_NAME != 'pull_request' || $HAS_COMPONENT_CHANGED == 1 ]]; then
    echo "::group::$component"
    crystal spec $component/spec --order random --error-on-warnings --exclude-warnings $component/spec || exit 1
    echo "::endgroup::"
  fi
done

I did need to pass in the BASE_SHA ENV var to the script. The reason being we need to diff base branch with the latest commit of the PR since a standard git diff wouldn’t work since the changes are already committed. Can checkout .github/workflows/ci.yml for the full workflow I’m using. This workflow runs on PRs opened into master as well as nightly to find any regressions in Crystal or upstream dependencies. The check is also setup to run all the specs in the context of non pull_request events, i.e. scheduled runs.

Once a commit makes it into the master branch, we need to the sync those changes to the child repository. I set this up by creating another workflow that runs on pushes to the master branch, as we can be assured changes are valid if they made it that far. This workflow runs the scripts/sync.sh script, which is based on a similar script used by the Laravel project. It essentially does the following for each component:

  1. Adds it as a remote.
  2. Fetches commits from the component’s child repository.
    1. Since we squashed merged, this is required since Git needs the list of commits in order to know how to split the components. Without this, Git would be missing the root commit and error trying to rebuild the history.
  3. Subtree push the changes to the child repository.

The sync script does make use of a Personal Access Token (PAT) from a Machine User that is part of the Athena organization, but only has write access to the child repositories. This is safer than using a PAT from a real account, as if something were to ever be leaked, the impact would be limited.

Finalizing the Components

At this point the monorepo is scaffolded, all the components added, and CI setup again. The last thing to do is go through each child repository and do some cleanup, both within the code itself and their GitHub settings. I ended up doing the following:

  1. Removed CI workflows.
  2. Update README.md to point back to the monorepo.
  3. Disable issues, moving any existing ones to the monorepo.
  4. Update branch protection rules to remove required status checks and restrict pushing to master to the CI organization team.
  5. Disable Actions.
  6. Delete any Secrets/Pages configurations leftover.
  7. Delete existing labels.
  8. Disable/checked various other GitHub settings for each repository.

The first four are the most important/helpful, the rest I kinda just did for security/minimalistic reasons given these are now meant to be read-only, so can really trim down permissions and such to match. In addition to this, I also made some changes to the monorepo itself:

  1. Revamping its labels.
  2. Auditing its settings as well.
  3. Revamp its README.md.

Ultimately I think there is quite a bit more that could be done to made the DX that much better, especially since there is now a single centralized repository versus a disparate amount. Some things that come to mind include:

  1. Enabling GitHub Discussions.
  2. Migrating to GitHub Issues for feature tracking/planning.
  3. Making some automated process to comment on/move PRs made in child repositories to the monorepo.
    1. E.g. Add Support for email-validator v3 by ben29 · Pull Request #5 · symfony/mailer · GitHub
  4. Scaffold out better issue/PR templates.
  5. Only sync components that have changes, like what was done for the specs.

The resulting PR: Monorepo by Blacksmoke16 · Pull Request #126 · athena-framework/athena · GitHub.

The Result

At this point I’m kinda scared to merge it. I normally squash merge everything into master to preserve the linear history. However you CANNOT do that when adding in subtreed components since some text within each commit is used to rebuild the history. I had to update the protected branch settings to allow me to override this, which i put back after clicking the button.

Upon merging, the sync job kicked off as expected: Migrate Athena to a monorepo · athena-framework/athena@85335f8 · GitHub. The first attempt did fail because i messed up permissions on one of the repositories, but once that was resolved it worked like a charm. In the end it took like ~30s to sync everything, traversing 129 commits. As also mentioned in the gotchas in the demo repository, this would result in the sync time to go up and up as the number of commits grows.

I setup the script to only sync components that had changed in order to help with that. There are also various projects that may be worth looking into as well. One example of such projects is splitsh, which is the open source version of the library Symfony used in their talk. However, I ran into an issue with it where the commit hashes are different between it and git subtree split. Not sure if it’s a bug or if I’m just misunderstanding how it works. Either way, I have some time to figure out what to do when/if it becomes more of an issue.

And there you have it! I hope this can help the migration of your projects so you too can enjoy the benefits of a monorepo. As usual feel free to join me in the Athena’s Gitter channel if you have any suggestions, questions, or ideas. I’m also available on Discord (Blacksmoke16#0016) or via Email.

P.S. Athena now has a Discord server!

13 Likes

Update: I realized the git diff logic was flawed as the files would already be committed when the action runs. I updated it to leverage some GHA context values. I also added diff support to the sync script, which should help with sync times by only doing those that need it.

Can checkout Improve CI scripts by Blacksmoke16 · Pull Request #127 · athena-framework/athena · GitHub for the full diff.

This is great. The problem of one:one shard:repo bugs me and I wish it didn’t require this much overhead to run a monorepo. A lot of the bigger Crystal projects end up managing a dozen shards at least and the release burden is significant — to release any one breaking change in a shard requires tagging and releasing shards everywhere.

3 Likes