Buildbot proposal: combine portwatcher and portbuilder

Sun Mar 11 09:25:33 UTC 2018

The current buildbot setup has a number of problems that I believe could be solved by combining the currently separate portwatcher and portbuilder schedulers into a single ports scheduler.

I am not suggesting that we return to the behavior of the ports scheduler on the old macOS forge buildbot system in which a single build would build all the specified ports. We will keep the current method of building only one port (and its dependencies) per build.

The problems I want to solve are the following:

1. Currently, portwatcher is responsible for updating a copy of mpbb, MacPorts base and a ports tree that it shares with portbuilder. Having portbuilder maintain its own copy would waste a lot of time. If someone makes one commit that changes 100 ports and then no further commits occur for hours, we only want to update mpbb, MacPorts base and the ports tree once, not 100 times. But the fact that it's shared means that portwatcher must (and is configured to) wait for all triggered portbuilder builds to finish before it processes the next commit. This works fine, unless the buildmaster is stopped while portwatcher builds are pending. This has happened several times when the servers lost power during a power outage. (The servers are on a UPS, but the UPS does not provide as much instantaneous power as I expected, so if the servers are busy building, they draw more power than the UPS can instantaneously provide and the servers shut down immediately. I might remove the buildworker machines from the UPS and leave only the buildmaster, modem and router on it.) When buildmaster comes back online, it sees the portbuilder build that was in progress and starts it again, but it also sees the portwatcher build that was in progress and starts it again. Now we have a portwatcher running (updating mpbb, updating MacPorts, updating the ports tree, and updating its portindex) while portbuilder is trying to install a port. The portbuilder can fail if it is trying to install ports at the moment that portwatcher is updating the index (see https://trac.macports.org/ticket/53587).

2. If a single portwatcher build "X" triggers many portbuilder builds, and while those portbuilder builds are in progress another commit comes in that would affect those ports, it don't notice until all portbuilder builds triggered by "X" are finished. This can waste time building ports that are already superseded by newer versions or revisions. An extreme example of this is if we were to force a portwatcher build for all ports (which we might want to do when a new version of macOS is released). mpbb, MacPorts base and the ports tree would be updated once, and then it would schedule a portbuilder for each port that had not yet been built. Building all ports will take weeks. During that time, a commit may come through that updates a port to a new version. But if the build of the old version of the port was still pending at that time, the buildbot will build the old version, because it can't update the ports tree until the current portwatcher build is done waiting for its triggered portbuilder builds.

3. When there are portwatcher builds pending, we have no idea how many portbuilder builds are pending. It may say there are e.g. 3 portbuilder builds pending, but the pending portwatchers could trigger any number of additional portbuilders.

An objection to this proposal was that buildbot 0.8 does not have the capability to dynamically create scheduler steps at runtime. But that's not required and that's not what I'm proposing.

Buildbot has the ability to call a function for each step to determine if that step should run, by specifying the doStepIf property. I recently started using this feature in portwatcher to skip the two trigger steps if there are no ports in the port list:

https://github.com/macports/macports-infrastructure/commit/18135d6c75698f88b48698473c9364063fb6fba9

Here is an example of what that looks like when it runs:

https://build.macports.org/builders/ports-10.13_x86_64-watcher/builds/3989

The only port that was committed there had already been built so it was excluded from the port list, leaving the list empty, so the portbuilder trigger step was skipped (to save time) and the mirror trigger step was skipped (to prevent it from printing an error that no ports were specified). The skipped steps are still shown in the web interface, but if that's not desired, they can be hidden by also using the hideStepIf property.

So my proposed combined ports scheduler would still contain all of the steps of the current portwatcher and portbuilder schedulers, but each build would still conceptually "be" either a portwatcher or a portbuilder, and for each build, the steps that don't relate to that conceptual function would be skipped and hidden.

Buildbot has the ability to associate custom properties with a build. We use this to set a "portname" property when we trigger portbuilder builds. portwatcher builds don't have that property. (They have a "portlist" property if they were forced from the web interface, and always have a "fullportlist" property which is the portlist plus the computed list of ports that were modified by the commit.) So if a build has the "portname" property, it is a portbuilder, otherwise it is a portwatcher. We can then write some simple functions:

    def is_portbuilder(step):
        return step.hasProperty('portname')

    def is_portwatcher(step):
        return not step.hasProperty('portname')

    def is_skipped(result, step):
        return (result == buildbot.status.results.SKIPPED)

The steps of the combined ports scheduler would then be:

* update mpbb (doStepIf=is_portwatcher, hideStepIf=is_skipped)
* clean (doStepIf=is_portwatcher, hideStepIf=is_skipped)
* selfupdate MacPorts (doStepIf=is_portwatcher, hideStepIf=is_skipped)
* sync ports tree (doStepIf=is_portwatcher, hideStepIf=is_skipped)
* get the list of subports (doStepIf=is_portwatcher, hideStepIf=is_skipped)
* trigger mirror scheduler (doStepIf=is_portwatcher, hideStepIf=is_skipped, waitForFinish=True)
* trigger ports scheduler for each subport (doStepIf=is_portwatcher, hideStepIf=True, waitForFinish=False)

* install dependencies (doStepIf=is_portbuilder, hideStepIf=is_skipped)
* install port (doStepIf=is_portbuilder, hideStepIf=is_skipped)
* gather archives (doStepIf=is_portbuilder, hideStepIf=is_skipped)
* upload archives to buildmaster (doStepIf=is_portbuilder, hideStepIf=is_skipped)
* deploy archives (doStepIf=is_portbuilder, hideStepIf=is_skipped)
* clean (doStepIf=is_portbuilder, hideStepIf=is_skipped)

(I'm advocating here for always hiding the ports trigger step. I don't find its output in the waterfall to be very useful and it can take up a lot of space if a lot of builds were triggered.)

This should solve problem (1). By having portwatcher and portbuilder tasks in the same queue, they can't run simultaneously so they can't cause the problems that happen when they run simultaneously.

Solving (2) and (3) requires an additional step. Buildbot allows you to define a nextBuild property when you create the builder, and pass it a function that determines which of the pending builds should go next.

http://docs.buildbot.net/0.8.12/manual/cfg-builders.html#builder-configuration

We would write a function that looks through the pending builds in the order in which they were scheduled, and picks the first one that is a portwatcher (the first one that doesn't have a "portname" property). If none are portwatchers, it picks the first* one. (* This is where we would later improve the situation to pick the next port in the correct dependency order, but I have another email about that.)

This solves (2) because now when a new commit comes in, a new "portwatcher" gets added to the end of the ports scheduler's queue, but it will be the next build picked immediately after the current "portbuilder" finishes, no matter how many other "portbuilders" are still pending. That will update the ports tree, so any pending builds for ports that were subsequently updated will build the now-current version, not the old version.

It also solves (3) because now if 100 "portbuilder" builds are pending, we don't have to wait until all of them are finished before we know how many builds all the pending "portwatcher" builds will schedule; we only have to wait until the current "portbuilder" finishes.

One drawback, depending on how you look at it, is that we would no longer be able to send a single combined email for all of the failed builds of a single commit, or of a single forced build. We would have to send an individual email for each failed build. Personally, I would prefer that, as the subject line of the email would make clear which port failed to build, rather than requiring me to open it to see what happened.

If we give the new combined scheduler a new name like "ports", that will invalidate all old links to build logs. That's not a deal-breaker for me, but since we often paste build log URLs into tickets, it would be nice to keep them alive if we can. Maybe defining empty portbuilder and portwatcher schedulers would be enough.