MacPorts caching of distfiles
Ryan Schmidt
ryandesign at macports.org
Sat Feb 23 17:20:06 PST 2008
On Feb 23, 2008, at 18:09, Jordan K. Hubbard wrote:
> On Feb 23, 2008, at 8:29 AM, Ryan Schmidt wrote:
>
>> Is this going to be a distfiles backup in case the original site
>> goes away? Or is this going to be a primary fetch location, before
>> the original master_sites? I assume the former, since we wouldn't
>> want to stop using the lovely global distributed server network
>> that for example sourceforge provides, would we?
>
> "Depends." For the sourceforge case, I can see merit to the
> argument that it should be last in the search path. For almost
> everything else, however, there's merit to the argument that it
> should be first since few other individual distribution points can
> compare to Apple's mighty multi-gigabit bandwidth and peering points.
>
> Looking at the percentages, we see:
>
> $ find dports/ -name Portfile|xargs egrep master_sites[[:space:]]
> +sourceforge|wc
> 891 1822 51961
>
> 891 ports out of 4500. I'd say that's a strong argument for it
> being a primary fetch location.
There are many fetchgroups, but sourceforge is the only one I know
that automatically does location-aware server selection. So I might
agree. It might also solve these issues:
There's a group of mirrors for apache software, but MacPorts is
currently downloading everything directly from www.apache.org which
is a no-no; see:
http://lists.macosforge.org/pipermail/macports-dev/2008-January/
004113.html
Paul Beard has previously noted that software from the gnome group
always gets downloaded from France because that's the first server in
the gnome fetchgroup, but this is inefficient if one is not also in
France; see:
http://lists.macosforge.org/pipermail/macports-users/2007-April/
002722.html
and:
http://lists.macosforge.org/pipermail/macports-users/2007-September/
005596.html
>> What, if anything, will we do with MacPorts-hosted distfiles that
>> we currently have in the repository? The repository has never been
>> a great place for distfiles to live IMHO.
>
> Agreed. I see no reason they couldn't also be hosted in the same way.
The distfiles that are in our repository are there because they're
not anywhere else. One reason is that the files used to be somewhere
else (e.g. on the master site) but then they were removed because
they were old, or they moved to a different URL, or the server died,
or the domain name expired, or something. These cases will be handled
by what's been proposed in this thread, since MacPorts will fetch the
files onto its distfile mirror the instant the port is committed. If
a server moves or files are renamed and so forth, the port author
won't discover it until trying to update the port to a new version.
But I suppose that's ok. Stuff continues to work for the end user,
which is better than what we have now when projects' download
locations change.
What about distfiles which are already missing today? How do we get
those into the distfiles mirror? Do we have to add them to the
repository, so that they can be fetched from somewhere during the
post-commit, and then remove them from the repository later? That's
wasteful of repository space. I guess committers can put it on their
own webspace temporarily, put that URL in the port's master_sites,
commit it so the post-commit fetches it, then remove it from the
master_sites and commit again. But that's messy. There should be a
way to get a distfile directly onto the mirror, for those cases where
it's supposed to act as master, not mirror.
What about the distfiles currently in the repository? Is there a
migration strategy for removing them? Or do we not care about the
disk space occupied by those distfiles in the repository? I guess
since the disk space won't be reclaimed unless we do a dump and
filter and load of the repository, and since that is a big pain to do
involving possibly lengthy downtime, we probably won't care enough
about the disk space.
What about distfiles that are stealth-upgraded? For example, I
updated the ImageMagick port to 6.3.8-9 on 2008-02-18 and a day later
a modified version of the 6.3.8-9 distfile appeared on the download
site. A user reported the checksum error to me and I found that a few
lines of the sourcecode had been changed in the new distfile, so I
updated the port revision and the checksums and committed it and
closed the ticket. If there had been a MacPorts distfile mirror first
in line providing the original distfile to the user, this situation
would never have been discovered, and MacPorts users would never get
the modified distfile, which seems like a bad thing. The author of
the software obviously updated the distfile for a reason and wants
users to have that new version.
The post-commit hook would have to do not only the fetch but also the
checksum phase. If the checksums don't match, then clean --all (i.e.
remove the (possibly old) distfile) and fetch and checksum again. If
it now checksums properly, great: the distfile was old and has now
been updated. If it still doesn't match then the author's checksums
are wrong and and we run clean --all again (to remove the bad
distfile from the mirror) and send an automated email to the
maintainer or committer or something. This takes care of the issue of
the old outdated distfile remaining on the mirror after the port
maintainer finds out about the stealth-upgrade and updates the
portfile. It does not however solve the problem of how the maintainer
would discover the stealth-upgrade in the first place. And it negates
one of the benefits of the mirror listed earlier: that older
distfiles should remain available for users who haven't updated their
ports tree or who deliberately are trying out an older version.
This latter problem even more greatly affects ports whose distfile
names do not contain the version number. By my rough grep estimate,
we have over 125 ports in this situation. Port authors will discover
a new version is available via the livecheck mechanism or via email
notification from the project's announce list, one would hope, but
once the update is committed, the old distfile won't be in the mirror
anymore, if it has the same name as the new file.
I believe I saw that the FreeBSD mirrors put distfiles into a
directory whose name is the md5 checksum of that file. If we managed
to do that somehow that might solve the problems.
The proposed solution does not cache / mirror distfiles which are
added as a result of selecting a variant or platform. Consider the
+doc variant of many ports which causes additional documentation
files to be downloaded, but there are other use cases as well; just
grep for "distfiles-append" in the portfiles and you'll get an idea.
There are ports that need to download different bootstrapping code
based on platform, ports that download extra code only needed for the
extra functionality enabled in a variant, etc.
The fetch phase honors variants too, so we could get the list of
variants with "port variants" and run the fetch phase once for each
variant (in addition to a run without any variants). e.g. for smlnj
we would end up running:
port fetch smlnj
port fetch smlnj +universal
port fetch smlnj +powerpc
port fetch smlnj +i386
In this port, all but +universal would end up fetching extra files.
We would need to anticipate that selecting some variants will cause
an error message and a nonzero return code, since it is common
practice to display an error message and exit with a nonzero return
code in the pre-fetch phase if we want to prevent the port from
installing. For example, py-psyco does this if not running on an
Intel Mac. We would want to ignore these errors in the post-commit hook.
There's an additional problem of ports that error out based on
platform and don't do so in a platform selector (so there's no
variant we could select to overcome it). For example, the wine port
exits in pre-fetch if you're not running on an Intel Mac, since wine
needs an Intel processor. You may tell me this is fine because the
Mac OS Forge server runs on Intel, but then you have the same problem
with the oracle-instantclient port, which exits if you're not running
on PowerPC, since the oracle instantclient currently needs a PowerPC.
More information about the macports-dev
mailing list