Buildbot proposal: combine portwatcher and portbuilder

Mon Mar 12 20:12:14 UTC 2018

On Mar 11, 2018, at 17:48, Rainer Müller wrote:
> On 2018-03-11 10:25, Ryan Schmidt wrote:
>> The current buildbot setup has a number of problems that I believe could be solved by combining the currently separate portwatcher and portbuilder schedulers into a single ports scheduler.
>> 
>> I am not suggesting that we return to the behavior of the ports scheduler on the old macOS forge buildbot system in which a single build would build all the specified ports. We will keep the current method of building only one port (and its dependencies) per build.
>> 
>> The problems I want to solve are the following:
>> 
>> 1. Currently, portwatcher is responsible for updating a copy of mpbb, MacPorts base and a ports tree that it shares with portbuilder. Having portbuilder maintain its own copy would waste a lot of time. If someone makes one commit that changes 100 ports and then no further commits occur for hours, we only want to update mpbb, MacPorts base and the ports tree once, not 100 times. But the fact that it's shared means that portwatcher must (and is configured to) wait for all triggered portbuilder builds to finish before it processes the next commit. This works fine, unless the buildmaster is stopped while portwatcher builds are pending. This has happened several times when the servers lost power during a power outage. (The servers are on a UPS, but the UPS does not provide as much instantaneous power as I expected, so if the servers are busy building, they draw more power than the UPS can instantaneously provide and the servers shut down immediately. I might remove the buildworker machines from the UPS and leave only the buildmaster, modem and router on it.) When buildmaster comes back online, it sees the portbuilder build that was in progress and starts it again, but it also sees the portwatcher build that was in progress and starts it again. Now we have a portwatcher running (updating mpbb, updating MacPorts, updating the ports tree, and updating its portindex) while portbuilder is trying to install a port. The portbuilder can fail if it is trying to install ports at the moment that portwatcher is updating the index (see https://trac.macports.org/ticket/53587).
> 
> Do the steps for selfupdate/sync really hurt that much that cannot just
> run them on every portbuilder run? Looking at the portwatcher build you
> linked as an example, these steps only took a few seconds in total. Why
> can we not just move these steps to the portbuilder?

I stand by my proposal. I want to update the ports tree clone on the buildbot workers only when it changes. I don't want to waste cycles updating a tree that has not changed.

> At the moment, portwatcher and portbuilder are sharing resources, but
> buildbot assumes each builder is isolated and that leads to these
> problems when resuming builds. I guess we should not do that...

If you want a separate portwatcher and portbuilder that each have their own separate copy of mpbb, MacPorts, the ports tree, and the portindex, not only is that more disk space used, but then a simple single port update commit will cause the buildworkers to have to do twice as much updating, which will take longer than what we're doing now. I want to optimize, not de-optimize! :)

>> 2. If a single portwatcher build "X" triggers many portbuilder builds, and while those portbuilder builds are in progress another commit comes in that would affect those ports, it don't notice until all portbuilder builds triggered by "X" are finished. This can waste time building ports that are already superseded by newer versions or revisions. An extreme example of this is if we were to force a portwatcher build for all ports (which we might want to do when a new version of macOS is released). mpbb, MacPorts base and the ports tree would be updated once, and then it would schedule a portbuilder for each port that had not yet been built. Building all ports will take weeks. During that time, a commit may come through that updates a port to a new version. But if the build of the old version of the port was still pending at that time, the buildbot will build the old version, because it can't update the ports tree until the current portwatcher build is done waiting for its triggered portbuilder builds.
> 
> To me it seems like this is only an issue for forced builds, not for the
> builds scheduled by commits.

No, it is relevant for all situations that cause lots of ports to build. My extreme example was forcing a build of all ports, but the same problem would happen if you committed a change to many ports, such as a commit that occurred in the past to remove $Id$ lines from all ports, or commits that will happen in the future to batch-add GitHub maintainer handles and add file size to checksums.

> So maybe for this use case the force scheduler should be on a level one
> higher in the hierarchy. Then the force scheduler would get a list of
> ports and for each schedule a portwatcher with only one port name (or
> use some other way of partitioning).

I don't see any need to do this.

>> 3. When there are portwatcher builds pending, we have no idea how many portbuilder builds are pending. It may say there are e.g. 3 portbuilder builds pending, but the pending portwatchers could trigger any number of additional portbuilders.
> 
> Why does it matter how many portbuilder builds will be scheduled later?
> I do not see the problem...?

I want some kind of indication of how much work the buildbot has left to do. We've already had to turn off the feature that estimates how much time the current build will take, since we don't know that, since each port takes a different amount of time to build. The only piece of information we have left that might provide some kind of answer is how many builds are pending, and our current setup doesn't give us that information either. My proposal provides this information quicker than our current setup.

> Overall, my immediate thought was that with the current portwatcher, we
> could just not wait for the triggered builds to finish. That seems to
> solve (1) and (2), but also lose the ability to send summary emails.
> Although I did not think enough about this whether it would really work.
> 
> Am I missing something why we would definitely need to merge portwatcher
> and portbuilder?

I would rather ask you: am I missing something why we want them to be separate? The waterfall is a logfile with a fancy interface. I want to `tail -f` that logfile and see the list of events happening on the ports queue. I don't want to have to switch between two different columns of output.

The other problem I wanted to fix, that I forgot to mention, is that the waterfall is too wide. I want one column and one queue for each port builder, not two. Especially since the waterfall is about to get wider still when we add more jobs to it.

>> An objection to this proposal was that buildbot 0.8 does not have the capability to dynamically create scheduler steps at runtime. But that's not required and that's not what I'm proposing.
>> 
>> Buildbot has the ability to call a function for each step to determine if that step should run, by specifying the doStepIf property. I recently started using this feature in portwatcher to skip the two trigger steps if there are no ports in the port list:
>> 
>> https://github.com/macports/macports-infrastructure/commit/18135d6c75698f88b48698473c9364063fb6fba9
>> 
>> Here is an example of what that looks like when it runs:
>> 
>> https://build.macports.org/builders/ports-10.13_x86_64-watcher/builds/3989
>> 
>> The only port that was committed there had already been built so it was excluded from the port list, leaving the list empty, so the portbuilder trigger step was skipped (to save time) and the mirror trigger step was skipped (to prevent it from printing an error that no ports were specified). The skipped steps are still shown in the web interface, but if that's not desired, they can be hidden by also using the hideStepIf property.
> 
> I noticed this and I like this change. +1
> 
>> So my proposed combined ports scheduler would still contain all of the steps of the current portwatcher and portbuilder schedulers, but each build would still conceptually "be" either a portwatcher or a portbuilder, and for each build, the steps that don't relate to that conceptual function would be skipped and hidden.
> 
>> [...]
> 
> It sounds like a complete hack to use doStepIf this way...

You could call our entire current buildbot setup a hack. Buildbot was intended to continuously build new versions of a single piece of software, not what we're doing. That didn't stop us from doing it...

>> This should solve problem (1). By having portwatcher and portbuilder tasks in the same queue, they can't run simultaneously so they can't cause the problems that happen when they run simultaneously.
>> 
>> Solving (2) and (3) requires an additional step. Buildbot allows you to define a nextBuild property when you create the builder, and pass it a function that determines which of the pending builds should go next.
>> 
>> http://docs.buildbot.net/0.8.12/manual/cfg-builders.html#builder-configuration
>> 
>> We would write a function that looks through the pending builds in the order in which they were scheduled, and picks the first one that is a portwatcher (the first one that doesn't have a "portname" property). If none are portwatchers, it picks the first* one. (* This is where we would later improve the situation to pick the next port in the correct dependency order, but I have another email about that.)
>> 
>> This solves (2) because now when a new commit comes in, a new "portwatcher" gets added to the end of the ports scheduler's queue, but it will be the next build picked immediately after the current "portbuilder" finishes, no matter how many other "portbuilders" are still pending. That will update the ports tree, so any pending builds for ports that were subsequently updated will build the now-current version, not the old version.
> 
> But the old builds would be left in the queue and follow after the
> portwatcher, so we still have multiple builds in the queue for the same
> port...

As I said in my reply to Mojca, we would implement https://trac.macports.org/ticket/55078 to make those duplicate builds exit quickly, and later maybe find a way to remove pending duplicate builds entirely.

> As the builds were scheduled by an earlier commit (which buildbot keeps
> in the "sourcestamp"), the attribution for notifications might be
> difficult. I am afraid it could lead to reports of a build failure for
> the wrong commit unless you manage to cancel the pending builds.

I have not seen build emails in awhile because of the email delivery problem I haven't investigated yet, but I'm pretty sure we already have this "problem", if you really think it is a problem. When we first implemented the new buildbot, we always built the latest version of a port, not the version that was committed.* So if a developer commits, in rapid succession, r5 updating a port to version 1.4, r6 adding a library dependency to it, and r7 increasing its revision, and this all occurs while the buildbot is busy building r4, then, when the build for r4 finishes and portwatcher gets around to processing r5, it will actually update the git clone to r7 before scheduling the portbuilder. Assuming that succeeds, when portwatcher processes r6, it will realize that the port has already been built and exit without scheduling any portbuilders, and the same will happen again for r7. On the other hand, the build of r7 that was scheduled for r5 might still fail for whatever reason. In the email that gets sent about that, if we're not already doing so, we might want to point out if we built a different commit than the one that triggered the build.

*Larry fixed that in https://github.com/macports/macports-infrastructure/commit/588e94a585c46532585d571957634b73ce42253c, but this ended up wasting a lot of time building ports that had already been superseded. See https://trac.macports.org/ticket/54648. I reverted the fix in https://github.com/macports/macports-infrastructure/commit/f02854986a62259279dccf21b4d2bf3bed939762.

> Buildbot would normally solve this by merging build requests. If a new
> build request gets put into the queue, the old one is discarded. The
> problem is that we cannot merge portwatcher builds, because we would
> only be able to do that as soon as we know the list of ports. However,
> we can only determine that once the portwatcher build is executed, so it
> is no option for us.
> 
> https://docs.buildbot.net/0.8.12/manual/cfg-builders.html#merging-build-requests
> 
> If the portwatcher triggered the portbuilder and then exited, we could
> look into merging the portbuilder jobs.

Hmm. Well, I don't know about that. I'm still hopeful that buildbot has an API we can use to inspect and cancel upcoming jobs. This feels like a minor optimization that can be done later. It would be good to do, but implementing https://trac.macports.org/ticket/55078 would get us most of the way there.

>> It also solves (3) because now if 100 "portbuilder" builds are pending, we don't have to wait until all of them are finished before we know how many builds all the pending "portwatcher" builds will schedule; we only have to wait until the current "portbuilder" finishes.
> 
>> One drawback, depending on how you look at it, is that we would no longer be able to send a single combined email for all of the failed builds of a single commit, or of a single forced build. We would have to send an individual email for each failed build. Personally, I would prefer that, as the subject line of the email would make clear which port failed to build, rather than requiring me to open it to see what happened.
> 
> Didn't we invest a lot of effort to find a way to only send a single
> email instead of sending one per build...?

Honestly, I didn't find the combined emails helpful. As I recall, if I made a commit that scheduled many port builds, and one of them failed to build, the email that was sent contained a link to every port's log, with no indication of which of them failed; I'm not going to click 100 or 50 or 10 log links to find the one that I was meant to look at.

Of course, we could improve the content of the emails and make it more useful.

Thinking about it further, there's no reason why we can't send a single email for each batch of builds. We can store the build logs grouped by a batch id. We probably already have some property that could be used as the batch id, such as the id of the portwatcher build that triggered the builds. Since we'll be triggering the builds in a predictable order, we know which build is the last one. We can flag the last build of the batch with a unique property. When a build has that property, that means it should gather all the logs from that batch and send it out.

But personally I would rather receive individual emails, as soon as a build fails. If we schedule a build of all ports, do you want to know that your port failed to build right when it happened, so that you can maybe fix it quickly and thus prevent some dependent ports from failing to build as a result, or do you want to wait weeks for all the other ports to finish first before you're notified?

>> If we give the new combined scheduler a new name like "ports", that will invalidate all old links to build logs. That's not a deal-breaker for me, but since we often paste build log URLs into tickets, it would be nice to keep them alive if we can. Maybe defining empty portbuilder and portwatcher schedulers would be enough.
> 
> I would not care about the logs at all. They are only interesting for a
> limited time and only for failed builds. It should be no problem to just
> delete old logs.

I care, because as I said I often paste build or log URLs into tickets, both MacPorts tickets and upstream tickets.

We could keep the combined scheduler named "portbuilder" to keep the old logs; we don't care about the portwatcher logs.

Or we could name the new scheduler "ports" but keep the logs and last build id of the portbuilder scheduler, and use a web server rule to fix up old portbuilder build URLs.