MacPorts caching of distfiles

Ryan Schmidt ryandesign at macports.org
Sat Feb 23 17:20:06 PST 2008


On Feb 23, 2008, at 18:09, Jordan K. Hubbard wrote:

> On Feb 23, 2008, at 8:29 AM, Ryan Schmidt wrote:
>
>> Is this going to be a distfiles backup in case the original site  
>> goes away? Or is this going to be a primary fetch location, before  
>> the original master_sites? I assume the former, since we wouldn't  
>> want to stop using the lovely global distributed server network  
>> that for example sourceforge provides, would we?
>
> "Depends."   For the sourceforge case, I can see merit to the  
> argument that it should be last in the search path.  For almost  
> everything else, however, there's merit to the argument that it  
> should be first since few other individual distribution points can  
> compare to Apple's mighty multi-gigabit bandwidth and peering points.
>
> Looking at the percentages, we see:
>
> $ find dports/ -name Portfile|xargs egrep master_sites[[:space:]] 
> +sourceforge|wc
>      891    1822   51961
>
> 891 ports out of 4500.  I'd say that's a strong argument for it  
> being a primary fetch location.

There are many fetchgroups, but sourceforge is the only one I know  
that automatically does location-aware server selection. So I might  
agree. It might also solve these issues:


There's a group of mirrors for apache software, but MacPorts is  
currently downloading everything directly from www.apache.org which  
is a no-no; see:

http://lists.macosforge.org/pipermail/macports-dev/2008-January/ 
004113.html


Paul Beard has previously noted that software from the gnome group  
always gets downloaded from France because that's the first server in  
the gnome fetchgroup, but this is inefficient if one is not also in  
France; see:

http://lists.macosforge.org/pipermail/macports-users/2007-April/ 
002722.html

and:

http://lists.macosforge.org/pipermail/macports-users/2007-September/ 
005596.html


>> What, if anything, will we do with MacPorts-hosted distfiles that  
>> we currently have in the repository? The repository has never been  
>> a great place for distfiles to live IMHO.
>
> Agreed.  I see no reason they couldn't also be hosted in the same way.

The distfiles that are in our repository are there because they're  
not anywhere else. One reason is that the files used to be somewhere  
else (e.g. on the master site) but then they were removed because  
they were old, or they moved to a different URL, or the server died,  
or the domain name expired, or something. These cases will be handled  
by what's been proposed in this thread, since MacPorts will fetch the  
files onto its distfile mirror the instant the port is committed. If  
a server moves or files are renamed and so forth, the port author  
won't discover it until trying to update the port to a new version.  
But I suppose that's ok. Stuff continues to work for the end user,  
which is better than what we have now when projects' download  
locations change.

What about distfiles which are already missing today? How do we get  
those into the distfiles mirror? Do we have to add them to the  
repository, so that they can be fetched from somewhere during the  
post-commit, and then remove them from the repository later? That's  
wasteful of repository space. I guess committers can put it on their  
own webspace temporarily, put that URL in the port's master_sites,  
commit it so the post-commit fetches it, then remove it from the  
master_sites and commit again. But that's messy. There should be a  
way to get a distfile directly onto the mirror, for those cases where  
it's supposed to act as master, not mirror.


What about the distfiles currently in the repository? Is there a  
migration strategy for removing them? Or do we not care about the  
disk space occupied by those distfiles in the repository? I guess  
since the disk space won't be reclaimed unless we do a dump and  
filter and load of the repository, and since that is a big pain to do  
involving possibly lengthy downtime, we probably won't care enough  
about the disk space.


What about distfiles that are stealth-upgraded? For example, I  
updated the ImageMagick port to 6.3.8-9 on 2008-02-18 and a day later  
a modified version of the 6.3.8-9 distfile appeared on the download  
site. A user reported the checksum error to me and I found that a few  
lines of the sourcecode had been changed in the new distfile, so I  
updated the port revision and the checksums and committed it and  
closed the ticket. If there had been a MacPorts distfile mirror first  
in line providing the original distfile to the user, this situation  
would never have been discovered, and MacPorts users would never get  
the modified distfile, which seems like a bad thing. The author of  
the software obviously updated the distfile for a reason and wants  
users to have that new version.

The post-commit hook would have to do not only the fetch but also the  
checksum phase. If the checksums don't match, then clean --all (i.e.  
remove the (possibly old) distfile) and fetch and checksum again. If  
it now checksums properly, great: the distfile was old and has now  
been updated. If it still doesn't match then the author's checksums  
are wrong and and we run clean --all again (to remove the bad  
distfile from the mirror) and send an automated email to the  
maintainer or committer or something. This takes care of the issue of  
the old outdated distfile remaining on the mirror after the port  
maintainer finds out about the stealth-upgrade and updates the  
portfile. It does not however solve the problem of how the maintainer  
would discover the stealth-upgrade in the first place. And it negates  
one of the benefits of the mirror listed earlier: that older  
distfiles should remain available for users who haven't updated their  
ports tree or who deliberately are trying out an older version.

This latter problem even more greatly affects ports whose distfile  
names do not contain the version number. By my rough grep estimate,  
we have over 125 ports in this situation. Port authors will discover  
a new version is available via the livecheck mechanism or via email  
notification from the project's announce list, one would hope, but  
once the update is committed, the old distfile won't be in the mirror  
anymore, if it has the same name as the new file.

I believe I saw that the FreeBSD mirrors put distfiles into a  
directory whose name is the md5 checksum of that file. If we managed  
to do that somehow that might solve the problems.


The proposed solution does not cache / mirror distfiles which are  
added as a result of selecting a variant or platform. Consider the  
+doc variant of many ports which causes additional documentation  
files to be downloaded, but there are other use cases as well; just  
grep for "distfiles-append" in the portfiles and you'll get an idea.  
There are ports that need to download different bootstrapping code  
based on platform, ports that download extra code only needed for the  
extra functionality enabled in a variant, etc.

The fetch phase honors variants too, so we could get the list of  
variants with "port variants" and run the fetch phase once for each  
variant (in addition to a run without any variants). e.g. for smlnj  
we would end up running:

port fetch smlnj
port fetch smlnj +universal
port fetch smlnj +powerpc
port fetch smlnj +i386

In this port, all but +universal would end up fetching extra files.

We would need to anticipate that selecting some variants will cause  
an error message and a nonzero return code, since it is common  
practice to display an error message and exit with a nonzero return  
code in the pre-fetch phase if we want to prevent the port from  
installing. For example, py-psyco does this if not running on an  
Intel Mac. We would want to ignore these errors in the post-commit hook.

There's an additional problem of ports that error out based on  
platform and don't do so in a platform selector (so there's no  
variant we could select to overcome it). For example, the wine port  
exits in pre-fetch if you're not running on an Intel Mac, since wine  
needs an Intel processor. You may tell me this is fine because the  
Mac OS Forge server runs on Intel, but then you have the same problem  
with the oracle-instantclient port, which exits if you're not running  
on PowerPC, since the oracle instantclient currently needs a PowerPC.




More information about the macports-dev mailing list