GSoC 2019 [Collect build statistics]

Mojca Miklavec mojca at macports.org
Sun Mar 24 16:31:49 UTC 2019


Hi,

(Sorry, this email got so long that I'll answer the others separately.)

On Sat, 23 Mar 2019 at 11:26, Arjun Salyan wrote:
> On Sat, Mar 23, 2019 at 3:15 PM Mojca Miklavec wrote:
>>
>> I would use the first definition: number of users currently having the
>> port installed. It might be pretty common to have to reinstall the
>> same port multiple times (maybe just for debugging / development
>> reasons) and we don't want to count the port developer 20 times. If
>> the user uninstalled the port, it's equivalent to me as never having
>> it installed in the first place.
>
>
> Thanks. But in that case what would be considered as number of installations in a particular month? Suppose, the first weekly submission contains port P in active_ports, but during second submission(in the same month), the port is uninstalled.
>
> One way would be to have it consider the number of users having it in active ports on the last day of the month or on 15th.

Short answer: I could consider the port as installed by a particular
user if it was reported as installed at least once in that month (if
it was installed during the first report, then uninstalled, count it
as installed; it will not be counted next month anyway if the user
just made a mistake / changed their mind).


Long answer:

I would say that there is no single correct answer (I'll try to give a
few examples below), but I find it quite important not to do any
"lossy data import" at the time of importing the statistics. Non-lossy
import allows you to change the representation of data (what to show
and how) at any given point in the future.

The existing statistics page discards a lot of information at the time
of import. For example: it just counts the overall number of a certain
macOS versions which turned out to be completely useless piece of
information if it's not correlated with time. We want to know how many
users of 10.8 we have today, not counting the users which have
migrated since.

A big mistake we did in the early days of GSOC is that we didn't try
to deploy the solutions early enough (this was properly deployed only
long after the GSOC was over), so the student only ever worked with
made-up data and nobody ever noticed that this would be a problem. But
even when put that late deployment aside ... if the data wasn't lost
during the statistics submission, we could still recalculate
historical data and change the representation to the exact form in
which we want it now (after months or years of experience and
feedback). If we still had raw data in the form of
    (uuid, timestamp, os_version)
we could still experiment with various data representations and draw
the desired graphs. Now we only keep
    (uuid, os_version)
in the database. Granted, from the second representation it's much
easier to draw the graph than from the first one, but the first one
bears a lot more information. With proper database indexing and some
non-trivial sql queries you could easily draw "any graph you want"
from the first table.

Ideally the database should contain only raw data, and then some views
to assist with further statistics. Certain pages could be cached, so
that the database would not need to recalculate the same data over and
over again even when the underlying data didn't change at all. Only if
we run into serious performance issues I would start doing some
pre-calculations and store them back to the database, maybe run
nightly, hourly or so.



Here are some examples of why I don't see a single correct answer to
your initial question. Let's assume that you know absolutely
everything about all MacPorts installation (exact timestamp of when
each port was installed or uninstalled, exact timestamp of MacPorts
installations / upgrades / removals ...) and you want to know the
answer to
    "How many users have port Foo installed on each OS version in March 2019?"

1.) Assume I have it installed on computer in the office, but I was on
vacations or business trip all March, so the computer was not even
online to submit its monthly statistics. Does that computer count? It
won't count now as it would not submit the statistics, but it could
count if you knew everything about that computer. If you recorded the
event when I installed the port and didn't see any uninstallation
/deactivation events since, you could still count it as active
(maybe). Well, you could argue that I didn't use that computer for a
month anyway, so it has all the rights not to be counted, which is a
fair argument, but ...

2.) I also have that port on my laptop and I used it actively during
that time. But since I was travelling, I hardly ever had access to
internet from the laptop (as good as never), so there would be no
statistics sent either.

3.) I have that port on my old laptop which I didn't turn on since the
last few months (but the software is still there). Even if you knew
everything about the history of macports installations on that laptop:
would you count that port? Probably not, you cannot even know if that
computer didn't end up in recycling in the meantime. Then I open it
again next month, the installation is still there, ports are reported
as present. You could potentially interpolate the missing months and
count the port as present in those months as well (you probably don't
want to actually do that, I'm just providing some border-case
examples).

4) You may know that the user installed the port on the 5th of March,
uninstalled it again five days later, then installed it again on the
25th. I assume you could in theory count this as "days_installed /
days_in_month" (or seconds_installed / seconds_in_month), but that
would be overdoing it; I would say that if the user reported the port
as installed at least once in that month, count it as installed. The
only thing that you really need to be careful about is not to count a
certain port as installed 10 times in case one user upgraded that port
9 times.


Additional points to bear in mind (not with a high priority):
- This requires modification of base, but we might want to add
statistics submission at each port install / uninstall / activate /
deactivate command. Not something to implement right now, but maybe
something to keep in subconscious mind when designing the database
representation.
- I'm not sure how the current submission works; would statistics even
be submitted if I'm offline at that one time in week when I was
supposed to send the statistics?

Mojca


More information about the macports-dev mailing list