Gsoc 18 Project | Collect build statistics

Tue Apr 3 15:34:06 UTC 2018

Dear Vishnu,

Thank you very much for sharing the document. The purpose of this HTML
was two-fold:
- demonstrating your skills
- first step of the planning phase for the actual implementation

Below I'm providing some feedback, but I would suggest to concentrate
on a simple django app at this moment and then return back to this
html once you are "done" with Django to address (some of) the
comments. In short: if selected, I'll insist to make this document
"perfect" before proceeding (and to address all the feedback + more I
didn't yet bother writing), but there's no point in asking you to
spend a week making this document ten times longer and fixing tiny
unimportant details that don't really demonstrate the skillset :)

On 2 April 2018 at 23:32, Mojca Miklavec wrote:
> V pon., 2. apr. 2018 19:49 je oseba Vishnu napisala:
>>
>> In the database.
>> Because then it would be very easy to count the number of os for that
>> port.
>
> I'll explain tomorrow why this is suboptimal. (But there's no need to
> further optimise the database design right now.)

There are probably better resources that explain this, but here's the
first hit from Google:

https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
https://en.wikipedia.org/wiki/Database_normalization

In extreme case, imagine that we decide to send a questionnaire to our
participants of statistics collection, asking them some 100 optional
questions, including anything from gender, age, country of origin,
country of current residence, education, favourite animal, ... Then we
decide that we would want to compare the age distribution of users of
package A vs. age distribution of users of package B.

Your idea that allows "very easy number counting" would mean that:

At the moment you only have (submission id, port, port version,
variants) in the table. You would need to extend the table to contain
    (submission id, submission time, user id, port, port version,
variants, os version, stdlib, xcode version, age, gender, country,
education, favourite animal, ...)
And if the user has 1000 ports installed, you would need to store
100x1000 cells (repeat that same information one thousand times and
then again in any subsequent submission from the same user) instead of
having a single copy in a separate "questionnaire" table. Multiply
that with 10.000 users submitting statistics and you end up with tens
of gigabytes of data each month, just to store results of that
one-time questionnaire.

On top of that, once the user submits a questionnaire, if you keep
those answers in a separate table and use proper SQL queries, you
could easily get the answer to question "what was the prevailing
gender of users of package A" even for submissions that were made many
months ago. If you store everything into a single monstrous table, you
would either need to modify plenty of old submissions or you would not
be able to get that information for old submissions at all.

Additionally, it could happen that while you are updating old
submissions, the database crashes. You could end up with half of the
entries updated and the other half left at their old value, in
inconsistent state. There are plenty of problems if you don't make
sure that you keep your database design in a good shape from the very
beginning.

That's a super common use case in databases that has already been
solved. One should use table joins and views. Random link (I'm sure
there are better ones):
    https://db.grussell.org/sql3.html

I don't know how Django handles joins and views (some hints I skimmed
through are here https://stackoverflow.com/a/1281051/585897), but one
should certainly make sure that the database design is done well.
Learning more about that topic is part of the process.

On 2 April 2018 at 23:50, Vishnu wrote:
>
> Please go through this https://jsfiddle.net/vishnum98/3r4vL4L3/21/
>
> I did some changes.

Thank you very much. The chart looks ok. For the remaining (missing)
charts just add a section (and optionally an empty box) and describe
what kind of chart goes there (no need for a long paragraph, just make
it clear what's on the Y axis).

I don't think we need a drop-down to select a version, but now that
you put it there, what I think would be helpful to have there is
something to switch between:
- absolute number of installations in that month
- number of installations of that port divided by total number of
submissions in that month
That is: having both absolute and relative numbers available.

To make it clear: don't bother actually implementing this now. You can
add a placeholder to remind you about that later (or just change the
contents of that drop-down to do this instead), nothing else.

We are mainly interested in the cumulative number of installations of
a particular. Version does tell something, but not *that* much, except
that the user did not update the ports for at least a month. We could
potentially make a cumulative diagram listing all versions, random
example:
    https://kanbanize.com/blog/wp-content/uploads/2014/01/Cumulativeflowfinal.png
but I would worry about that *at the very end*.

What would be a much better *global* measure would be the time since
the user last updated PortIndex, but I have no clue how to get that
information in a reliable way (and it's certainly not your task to
worry about it).

Further comments:

* Some more items from the proposal are still missing, like whether
the package is outdated, latest commits, link to tickets, ... No need
to do anything fancy, just put some placeholder there.

* Build statistics will need more work. I mean: the table as it is
looks nice. But we'll probably want to represent the information in
two different ways. One way listing all builds the way you did now.
And the other one in approximately this way:
    https://trac.macports.org/ticket/55978#Viewnr.3:Overviewofhistoryofbuildsofaparticularport

* I'll save more nitpicking for later :)

Mojca