Graceful Recovery From malloc Failure
S. Gilles
2017-08-04
Background
An argu^Wpolite discussion I've seen play out a few times so far
goes like this:
A: I have created a new library/operating system/programming
language for you to use. Please enjoy.
B: I have a bug report: the memory allocations are not checked
for failure. Very bad things will happen on line N + k if
the memory allocation on line N failed.
A: Thank you for the report - I've made a fix.
B: Your fix is to call abort(3) if malloc(3) ever fails. This
is better, but could you instead clean up the heap and
report failure to the user?
A: That would be really hard, and what are you going to do
besides abort() anyway?
In my experience, by this point in the conversation negotiations
have broken down a little, and no progress is made.
Here's a perfect example of what could be done besides abort():
Background (topology)
For my research, I'm interested in generalizing (very slightly)
Thurston's Gluing Equations: for any hyperbolic 3-manifold M, if
you are interested in listing all possible “nice” maps
f : { loops in M } → { Some Lie group G }
it turns out that it suffices to cut M up into a (finite) number
of tetrahedra, and then associate to each tetrahedron a (finite)
set of (non-zero) complex numbers: the A-coordinates. Performing
some mechanical computations, you get a second (finite) set of
(non-zero) complex numbers: the X-coordinates.
Not only that, but the {A,X}-coordinates are associated to
particular geometric locations on those tetrahedra, and that f(γ)
can be calculated by (roughly) multiplying together the X-coordinates
that lie along the loop γ.
For the manifold M = S₃ \ 4₁ (the complement of the figure-eight
knot, which has a hyperbolic structure), for example, it turns
out that there are only 2 tetrahedra, and to consider maps into
SO(5,ℝ) requires with 23 X-coordinates for each tetrahedron
(according to one convention, if I've counted correctly). So if
you know 46 (non-zero) complex numbers, you can compute f.
It turns out 46 numbers are too many. Some are always in pairs
of (x, 1/x), some may not be used in recovering f, etc. The
question then arises “What are the exact set of restrictions we
need to put on thes 46 numbers to guarantee that they correspond
to some f?”.
My research is to (using specific “quivers” developed recently,
independently, by Zickert and Le) try and answer this question
when G is a Lie group of type Bₙ, Cₙ or (hopefully) Gₙ. My approach
is not very clever, but it will give context to the situation in
which I care about memory errors.
I fix a canonical set of A-coordinates as a basis/super-basis,
then force the assumption that the A-coordinates belong to a
triangulation (these equations aren't difficult). Then I represent
all X-coordinates as rational polynomials in terms of that basis.
Next, I evaluate those polynomials for some pseudorandom assignments
of complex numbers to A-coordinates. This allows me to make some
guesses about which products of X-coordinates will be 1. When I
symbolically confirm a guess, I obtain a restriction on the
X-coordinates, which must be satisfied for an X-coordinate
assignment to correspond to a triangulation
Once I have enough necessary restrictions, I'll test whether they
are sufficient through other means, so for now the important thing
is to multiply as many different polynomials together and list
all ways to get 1.
The code in question:
X := { the rational polynomials representing the X-coordinates }
C := make_guesses_via_numerical_approximation(X)
for c in C, (c is a list with length n) {
if (Prod over i=1,2,…n of X[c[i]] is 1) { (*)
output({ X[c[i]] : i=1,2,…n })
}
}
Line (*) is performed using the PARI library[0]. Regardless of
whatever optimizations are applied to obtain C, some products
must be computed to check if some variables cancel. For reference,
one of the worse elements of X, in one computation, is
x'_{1'} = ((a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁³·a₀₃·a₂₃²·a₀² + ((a'₁³·a'₃₂²·a₁³
+ a'₂·a'₁³·a'₃₂·a₂·a₁)·a₀₃·a₂₃² + (a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁⁵ +
a'₂·a'₁·a'₀₃·a'₂₃²·a₂·a₁³)·a₃₂)·a₀ + (a'₁³·a'₃₂²·a₁⁵ +
2·a'₂·a'₁³·a'₃₂·a₂·a₁³ + a'₂²·a'₁³·a₂²·a₁)·a₃₂)·aₛ² +
(((a'₂·a'₁³·a'₃₂·a₁·a₀₃·a₃₀ + a'₂·a'₁²·a'₃₀·a'₃₂·a₁²·a₀₃)·a₂₃³
+ (a'₂·a'₁²·a'₃₂·a'₂₃·a₁²·a₀₃·a₃₀ +
a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁³·a₀₃)·a₂₃²)·a₀² + (((a'₂·a'₁³·a'₃₂·a₁³
+ a'₂²·a'₁³·a₂·a₁)·a₃₀ + (a'₂·a'₁²·a'₃₀·a'₃₂·a₁⁴ +
a'₂²·a'₁²·a'₃₀·a₂·a₁²))·a₃₂·a₂₃ + ((a'₂·a'₁²·a'₃₂·a'₂₃·a₁⁴ +
a'₂²·a'₁²·a'₂₃·a₂·a₁²)·a₃₀ + (a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁⁵ +
a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁³))·a₃₂)·a₀)·aₛ)/((a'₀₃²·a'₂₃³·a₁²·a₀₃·a₂₃³·a₀³
+ (a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁²·a₀₃·a₂₃³ +
2·a'₂·a'₁·a'₀₃·a'₂₃²·a₂·a₁·a₀₃·a₂₃² + a'₀₃²·a'₂₃³·a₁⁴·a₃₂·a₂₃)·a₀²
+ ((a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁⁴ + a'₂·a'₁²·a'₀₃·a'₂₃·a₂·a₁²)·a₃₂
+ (a'₂·a'₁²·a'₃₂·a'₂₃·a₂·a₁² + a'₂²·a'₁²·a'₂₃·a₂²)·a₀₃)·a₂₃·a₀)·aₛ²
+ ((2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁·a₀₃·a₃₀ +
2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁²·a₀₃)·a₂₃³·a₀³ +
((2·a'₂²·a'₁²·a'₂₃·a₂·a₀₃·a₃₀ + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁·a₀₃)·a₂₃²
+ (2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁³·a₃₀ +
2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁⁴)·a₃₂·a₂₃)·a₀²)·aₛ +
((a'₂²·a'₁²·a'₂₃·a₀₃·a₃₀² + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁·a₀₃·a₃₀
+ a'₂²·a'₃₀²·a'₂₃·a₁²·a₀₃)·a₂₃³·a₀³ + (a'₂²·a'₁²·a'₂₃·a₁²·a₃₀²
+ 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁³·a₃₀ + a'₂²·a'₃₀²·a'₂₃·a₁⁴)·a₃₂·a₂₃·a₀²))
This takes, I estimate, somewhere on the order of 10KiB for PARI
to store in a way that can be used efficiently.
If a few elements of that complexity are multiplied together, and
the product doesn't cancel to 1, it's entirely possible that
PARI's internal representation of that result will take more than
4GiB of memory (which appears to be the maximum size I can make
PARI's stack).
Unluckily, PARI's stack-based operations kill the program on
failure. In my particular situation, overflowing PARI's stack is
something I could recover from: I'd just output a warning that
that particular product needs to be checked by hand.
This is the prime sort of example that I think B should be giving:
programs performing memory-intensive operations, some of which
are optional. I am content with a small number of warnings on
stderr regarding products which are difficult to compute. I am
not content with a 12 hour computation aborting 7 hours in.
Fixes
In my particular case, there are fixes. PARI primarily works off
its own, internal stack, but objects can be moved to main memory
if needed. This removes the 4GiB limit.
More importantly, however, PARI exposes the top, bottom, and
current position of its internal stack. This allows me to monitor
memory usage in (*), and cleanly skip the computation if too much
of the stack memory starts being used.
It also turns out that checking for continual memory growth is
an excellent heuristic that a computation is probably not going
to cancel to 1. If
o all factors have roughly the same initial size,
o “unlucky” multiplications increase memory usage exponentially,
and
o such growth has been observed when multiplying ceil(half)
factors,
it's very improbable that the last few factors have enough
complexity to cancel the intermediate product, so we can abandon
the computation, skipping the last (and probably the most intensive)
steps. This also allows me to use low 10s of MiB, not GiB, for
PARI's stack.
[0] http://pari.math.u-bordeaux.fr