Speed up trace mode

Mon Mar 9 13:29:40 PDT 2015

Hi Shashwat,

----- On 9 Mar, 2015, at 00:36, Shashwat Pandey devshashwatpandey at gmail.com wrote:

> 'Speed up trace mode' suggests implementation of a cache data structure to
> improve performance of trace mode.
> 
> I have been trying to understand the code related to trace mode. The
> darwintrace shared library gets linked to the process and the sandbox is
> defined in porttrace.tcl. The tracelib.c and .h files in pextlib1.0 set up
> the unix sockets to communicate with the darwintrace.dylib and build the
> filemap for darwintrace. Any violations of the sandbox are reported by
> darwintrace. The tracemode also respects recursively collected port
> dependencies by not counting them as violations which is done in
> portutil.tcl where the trace procedure actually executes.

Correct, that's a very good analysis of the code.

> I also read the gsoc2007 wiki for the project in which trace mode was
> implemented. The code has grown a lot in complexity since then.

Yes, there were a couple of things that did not work correctly and were not
handled at all, such as filtering the output of readdir(3) or access(2).
I've added those and refactored some of the code. Eventually the code
became too large to maintain in a single file and was split across multiple
files. The rest of the complexity is due to symlink handling and path
normalization, where we cannot rely on realpath(3), because it calls some
of the functions we hook. We can also not let the system handle symlink
resolution and path normalization, because symlinks might also be registered
to packages that are not in the transitive dependency graph of a port.

The symlink handling is currently still not theoretically correct and has
flaws, but those are currently intentional due to speed considerations. In
its current state, trace mode has about 80% overhead compared to a normal
build, which is manageable, but still a lot.

> After reading all this I am still not able to understand the project. Is
> the cache data structure (trie) supposed to store the file paths of the
> sandbox? How does it improve the performance? What are the bottlenecks to
> the performance of trace mode?

The main bottleneck of trace mode is the socket communication required to
find out whether a file can be accessed or should be hidden. To outline why
that's slow, here's what it does:
 for each file or symlink (possibly multiple times until all symlinks have
 been followed):
   (1) write to the unix socket to request a file -> port lookup from the
       server in pextlib1.0/tracelib.c, one context switch
   (2) block the thread and wait for the answer, another context switch,
       plus time to wait for the scheduler and re-scheduling of the process
       after the result is available
   (3) the server processes needs to read and start processing the request
       (depending on whether the server was currently running or not, that's
       another wait-for-scheduler cycle here)
   (4) the server uses the sqlite database on disk and runs a query, a
       couple of read and operations from disk
   (5) send the result back to the client and block
   (6) result is available, wake up and continue
We're currently doing this for each file inside /opt/local, even if the same
file has been read a couple of milliseconds earlier. In addition, we're also
doing it even if the same thread or a different thread in the same process
queried the information earlier. I would expect caching of this lookup to
speed up the process considerably, especially since reading the opened file
from disk is usually slower than this lookup, so our trace mode activities
could actually no longer be dominated by our code, but by I/O performance.

There's nothing in the answers the server computes that would differ across
processes or threads; ideally, I would thus propose a cache that is shared
among all processes and threads in the current build phase. In addition, it
is probably a bad idea to use locking in this code, because we never know
whether the code might be run in a situation where locking could lead to
deadlocks.

So basically, I'm thinking about a cache that
 - uses shared memory with multiple processes reading and writing
   concurrently
 - uses lock-free synchronization to avoid deadlocks
 - uses a data structure that has quick key -> value lookups in practice

Since the cache is going to grow there needs to be a method to atomically
grow the shared memory segment, which, I think, is possible using
ftruncate(2), a variable indicating the current size, and a compare-and-swap
loop. Shrinking the cache may be hard, but I don't think we'd actually need
that – not deleting any data would have the benefit of easily giving us a
representation of all files the build has accessed at the end.

A Trie is only one example of a data structure that could be used to achieve
this. There might be others, such as a collection of linked lists (and those
might even turn out to be fast enough in practice, although I wouldn't bet
on it).

So basically, we're looking for somebody who knows about system-level
programming against a POSIX API, lock-free synchronization using compare and
swap, efficient data structures, and possibly IPC using shared memory or a
good understanding of the pitfalls of concurrency using both threads and
signal handlers.

> Kindly explain this project idea in greater depth.
> 
> I thoroughly enjoyed working for MacPorts last GSoC. I hope I will get a
> chance to contribute to it again for this year's GSoC.

If you want to apply again, by all means, do so. I'm looking forward to it.

-- 
Clemens Lang