# Graceful Recovery From malloc Failure

## Found at: sdf.org:70/users/sgilles/phlog/2017-08-04_graceful_recovery_from_malloc_failure.txt

Graceful Recovery From malloc Failure

S. Gilles

2017-08-04

Background

An argu^Wpolite discussion I've seen play out a few times so far

goes like this:

A: I have created a new library/operating system/programming

language for you to use. Please enjoy.

B: I have a bug report: the memory allocations are not checked

for failure. Very bad things will happen on line N + k if

the memory allocation on line N failed.

A: Thank you for the report - I've made a fix.

B: Your fix is to call abort(3) if malloc(3) ever fails. This

is better, but could you instead clean up the heap and

report failure to the user?

A: That would be really hard, and what are you going to do

besides abort() anyway?

In my experience, by this point in the conversation negotiations

have broken down a little, and no progress is made.

Here's a perfect example of what could be done besides abort():

Background (topology)

For my research, I'm interested in generalizing (very slightly)

Thurston's Gluing Equations: for any hyperbolic 3-manifold M, if

you are interested in listing all possible “nice” maps

f : { loops in M } → { Some Lie group G }

it turns out that it suffices to cut M up into a (finite) number

of tetrahedra, and then associate to each tetrahedron a (finite)

set of (non-zero) complex numbers: the A-coordinates. Performing

some mechanical computations, you get a second (finite) set of

(non-zero) complex numbers: the X-coordinates.

Not only that, but the {A,X}-coordinates are associated to

particular geometric locations on those tetrahedra, and that f(γ)

can be calculated by (roughly) multiplying together the X-coordinates

that lie along the loop γ.

For the manifold M = S₃ \ 4₁ (the complement of the figure-eight

knot, which has a hyperbolic structure), for example, it turns

out that there are only 2 tetrahedra, and to consider maps into

SO(5,ℝ) requires with 23 X-coordinates for each tetrahedron

(according to one convention, if I've counted correctly). So if

you know 46 (non-zero) complex numbers, you can compute f.

It turns out 46 numbers are too many. Some are always in pairs

of (x, 1/x), some may not be used in recovering f, etc. The

question then arises “What are the exact set of restrictions we

need to put on thes 46 numbers to guarantee that they correspond

to some f?”.

My research is to (using specific “quivers” developed recently,

independently, by Zickert and Le) try and answer this question

when G is a Lie group of type Bₙ, Cₙ or (hopefully) Gₙ. My approach

is not very clever, but it will give context to the situation in

which I care about memory errors.

I fix a canonical set of A-coordinates as a basis/super-basis,

then force the assumption that the A-coordinates belong to a

triangulation (these equations aren't difficult). Then I represent

all X-coordinates as rational polynomials in terms of that basis.

Next, I evaluate those polynomials for some pseudorandom assignments

of complex numbers to A-coordinates. This allows me to make some

guesses about which products of X-coordinates will be 1. When I

symbolically confirm a guess, I obtain a restriction on the

X-coordinates, which must be satisfied for an X-coordinate

assignment to correspond to a triangulation

Once I have enough necessary restrictions, I'll test whether they

are sufficient through other means, so for now the important thing

is to multiply as many different polynomials together and list

all ways to get 1.

The code in question:

X := { the rational polynomials representing the X-coordinates }

C := make_guesses_via_numerical_approximation(X)

for c in C, (c is a list with length n) {

if (Prod over i=1,2,…n of X[c[i]] is 1) { (*)

output({ X[c[i]] : i=1,2,…n })

}

}

Line (*) is performed using the PARI library[0]. Regardless of

whatever optimizations are applied to obtain C, some products

must be computed to check if some variables cancel. For reference,

one of the worse elements of X, in one computation, is

x'_{1'} = ((a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁³·a₀₃·a₂₃²·a₀² + ((a'₁³·a'₃₂²·a₁³

+ a'₂·a'₁³·a'₃₂·a₂·a₁)·a₀₃·a₂₃² + (a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁⁵ +

a'₂·a'₁·a'₀₃·a'₂₃Â²·a₂·a₁³)·a₃₂)·a₀ + (a'₁³·a'₃₂²·a₁⁵ +

2·a'₂·a'₁³·a'₃₂·a₂·a₁³ + a'₂²·a'₁³·a₂²·a₁)·a₃₂)·aₛ² +

(((a'₂·a'₁³·a'₃₂·a₁·a₀₃·a₃₀ + a'₂·a'₁²·a'₃₀·a'₃₂·a₁²·a₀₃)·a₂₃³

+ (a'₂·a'₁²·a'₃₂·a'₂₃·a₁²·a₀₃·a₃₀ +

a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁³·a₀₃)·a₂₃²)·a₀² + (((a'₂·a'₁³·a'₃₂·a₁³

+ a'₂²·a'₁³·a₂·a₁)·a₃₀ + (a'₂·a'₁²·a'₃₀·a'₃₂·a₁⁴ +

a'₂²·a'₁²·a'₃₀·a₂·a₁²))·a₃₂·a₂₃ + ((a'₂·a'₁²·a'₃₂·a'₂₃·a₁⁴ +

a'₂²·a'₁²·a'₂₃·a₂·a₁²)·a₃₀ + (a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁⁵ +

a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁³))·a₃₂)·a₀)·aₛ)/((a'₀₃²·a'₂₃³·a₁²·a₀₃·a₂₃³·a₀³

+ (a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁²·a₀₃·a₂₃³ +

2·a'₂·a'₁·a'₀₃·a'₂₃²·a₂·a₁·a₀₃·a₂₃² + a'₀₃²·a'₂₃³·a₁⁴·a₃₂·a₂₃)·a₀²

+ ((a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁⁴ + a'₂·a'₁²·a'₀₃·a'₂₃·a₂·a₁²)·a₃₂

+ (a'₂·a'₁²·a'₃₂·a'₂₃·a₂·a₁² + a'₂²·a'₁²·a'₂₃·a₂²)·a₀₃)·a₂₃·a₀)·aₛ²

+ ((2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁·a₀₃·a₃₀ +

2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁²·a₀₃)·a₂₃³·a₀³ +

((2·a'₂²·a'₁²·a'₂₃·a₂·a₀₃·a₃₀ + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁·a₀₃)·a₂₃²

+ (2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁³·a₃₀ +

2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁⁴)·a₃₂·a₂₃)·a₀²)·aₛ +

((a'₂²·a'₁²·a'₂₃·a₀₃·a₃₀² + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁·a₀₃·a₃₀

+ a'₂²·a'₃₀²·a'₂₃·a₁²·a₀₃)·a₂₃³·a₀³ + (a'₂²·a'₁²·a'₂₃·a₁²·a₃₀²

+ 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁³·a₃₀ + a'₂²·a'₃₀²·a'₂₃·a₁⁴)·a₃₂·a₂₃·a₀²))

This takes, I estimate, somewhere on the order of 10KiB for PARI

to store in a way that can be used efficiently.

If a few elements of that complexity are multiplied together, and

the product doesn't cancel to 1, it's entirely possible that

PARI's internal representation of that result will take more than

4GiB of memory (which appears to be the maximum size I can make

PARI's stack).

Unluckily, PARI's stack-based operations kill the program on

failure. In my particular situation, overflowing PARI's stack is

something I could recover from: I'd just output a warning that

that particular product needs to be checked by hand.

This is the prime sort of example that I think B should be giving:

programs performing memory-intensive operations, some of which

are optional. I am content with a small number of warnings on

stderr regarding products which are difficult to compute. I am

not content with a 12 hour computation aborting 7 hours in.

Fixes

In my particular case, there are fixes. PARI primarily works off

its own, internal stack, but objects can be moved to main memory

if needed. This removes the 4GiB limit.

More importantly, however, PARI exposes the top, bottom, and

current position of its internal stack. This allows me to monitor

memory usage in (*), and cleanly skip the computation if too much

of the stack memory starts being used.

It also turns out that checking for continual memory growth is

an excellent heuristic that a computation is probably not going

to cancel to 1. If

o all factors have roughly the same initial size,

o “unlucky” multiplications increase memory usage exponentially,

and

o such growth has been observed when multiplying ceil(half)

factors,

it's very improbable that the last few factors have enough

complexity to cancel the intermediate product, so we can abandon

the computation, skipping the last (and probably the most intensive)

steps. This also allows me to use low 10s of MiB, not GiB, for

PARI's stack.

[0] http://pari.math.u-bordeaux.fr