N-Body benchmark and the refFlatten and the deepFlatten optimizations

View: New views
2 Messages — Rating Filter:   Alert me  

N-Body benchmark and the refFlatten and the deepFlatten optimizations

by Vesa Karvonen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

While changing my n-body benchmark implementation in mltonlib

  http://mlton.org/cgi-bin/viewsvn.cgi/mltonlib/trunk/org/mlton/vesak/toys/n-body/

to use a separate library for 3d vector arithmetic, I realized that there
was basically no reason why the benchmark should be allocating memory.
Yet, gc-summary showed that it allocated a total of about 3G bytes.

So, I spent some time reading the generated assembly code (actually, I
first tried allocation profiling, but, in retrospect, failed to interpret
the results properly to find the exact place of unwanted allocation) and
noticed that the pos field of a body (see the benchmark code) seemed to be
using the traditional "pointer to mutable pointer to value"
representation.  This meant that the line

   ; #pos a := pos a |+| vel a |* dt

allocated one vector per iteration of the advance function.  This also
meant that neither the refFlatten (http://mlton.org/RefFlatten) nor the
deepFlatten (http://mlton.org/DeepFlatten) optimizations were applied.
Either one of these optimizations should have eliminated the allocation,
AFAIU.

I'm not sure what exactly causes the optimizations to be missed.  I would
assume that refFlatten is trickier to apply and requires a more complex
analysis.  The deepFlatten page does not explain the associated analysis
in any way (and I haven't looked at the implementation of deepFlatten).  I
find it slightly surprising that the deepFlatten optimization is not
applied in this case.  The objects are finite sized and not very large (3
reals (+ overhead)).  Shouldn't this make it safe (for space) to apply
deepFlatten?  Of course there are various trade-offs here (between
allocation, copying, sharing), and it is impossible to make optimal
decisions in every case.

The comments on the refFlatten page led me to think that the problem (not
applying either optimization) might be caused by the code that creates the
system (see the benchmark code) by mapping over a list during which ref
cells are being allocated.  So, I manually eliminated the map call.  This
obviously changed the results of some analysis, eliminated the allocation
and improved performance considerably from about 32 to about 26
seconds.  If I interpret the generated code correctly, then the
deepFlatten optimization was now applied, but not refFlatten, which
should give even better results by avoiding one more indirection.

It would seem that improving the analyses of the refFlatten and deepFlatten
optimizations, so that they could be applied in more cases, could have
significant performance benefits.

Below is a patch to replace the system definition in the benchmark, which
improves performance significantly by elimination allocation as discussed.

-Vesa Karvonen


Index: n-body.sml
===================================================================
--- n-body.sml (revision 6665)
+++ n-body.sml (working copy)
@@ -36,41 +36,37 @@
 fun vel (b : body) = ! (#vel b)

 val system =
-    map (fn {pos, vel, mass} =>
-            {pos = ref pos,
-             vel = ref (vel |* daysPerYear),
-             mass = mass * solarMass})
-        [{pos = {x = 0.0, y = 0.0, z = 0.0},
-          vel = {x = 0.0, y = 0.0, z = 0.0},
-          mass = 1.0},
-         {pos = {x = 4.84143144246472090,
+    [{pos = ref {x = 0.0, y = 0.0, z = 0.0},
+      vel = ref {x = 0.0, y = 0.0, z = 0.0},
+      mass = solarMass},
+     {pos = ref {x = 4.84143144246472090,
                  y = ~1.16032004402742839,
                  z = ~1.03622044471123109e~1},
-          vel = {x = 1.66007664274403694e~3,
-                 y = 7.69901118419740425e~3,
-                 z = ~6.90460016972063023e~5},
-          mass = 9.54791938424326609e~4},
-         {pos = {x = 8.34336671824457987,
+      vel = ref ({x = 1.66007664274403694e~3,
+                  y = 7.69901118419740425e~3,
+                  z = ~6.90460016972063023e~5} |* daysPerYear),
+      mass = 9.54791938424326609e~4 * solarMass},
+     {pos = ref {x = 8.34336671824457987,
                  y = 4.12479856412430479,
                  z = ~4.03523417114321381e~1},
-          vel = {x = ~2.76742510726862411e~3,
-                 y = 4.99852801234917238e~3,
-                 z = 2.30417297573763929e~5},
-          mass = 2.85885980666130812e~4},
-         {pos = {x = 1.28943695621391310e1,
+      vel = ref ({x = ~2.76742510726862411e~3,
+                  y = 4.99852801234917238e~3,
+                  z = 2.30417297573763929e~5} |* daysPerYear),
+      mass = 2.85885980666130812e~4 * solarMass},
+     {pos = ref {x = 1.28943695621391310e1,
                  y = ~1.51111514016986312e1,
                  z = ~2.23307578892655734e~1},
-          vel = {x = 2.96460137564761618e~3,
-                 y = 2.37847173959480950e~3,
-                 z = ~2.96589568540237556e~5},
-          mass = 4.36624404335156298e~5},
-         {pos = {x = 1.53796971148509165e1,
+      vel = ref ({x = 2.96460137564761618e~3,
+                  y = 2.37847173959480950e~3,
+                  z = ~2.96589568540237556e~5} |* daysPerYear),
+      mass = 4.36624404335156298e~5 * solarMass},
+     {pos = ref {x = 1.53796971148509165e1,
                  y = ~2.59193146099879641e1,
                  z = 1.79258772950371181e~1},
-          vel = {x = 2.68067772490389322e~3,
-                 y = 1.62824170038242295e~3,
-                 z = ~9.51592254519715870e~5},
-          mass = 5.15138902046611451e~5}]
+      vel = ref ({x = 2.68067772490389322e~3,
+                  y = 1.62824170038242295e~3,
+                  z = ~9.51592254519715870e~5} |* daysPerYear),
+      mass = 5.15138902046611451e~5 * solarMass}]

 fun advance dt =
  fn []    => ()

_______________________________________________
MLton mailing list
MLton@...
http://mlton.org/mailman/listinfo/mlton

Re: N-Body benchmark and the refFlatten and the deepFlatten optimizations

by Matthew Fluet-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 2 Jul 2008, Vesa Karvonen wrote:

> I'm not sure what exactly causes the optimizations to be missed.  I would
> assume that refFlatten is trickier to apply and requires a more complex
> analysis.  The deepFlatten page does not explain the associated analysis
> in any way (and I haven't looked at the implementation of deepFlatten).  I
> find it slightly surprising that the deepFlatten optimization is not
> applied in this case.  The objects are finite sized and not very large (3
> reals (+ overhead)).  Shouldn't this make it safe (for space) to apply
> deepFlatten?  Of course there are various trade-offs here (between
> allocation, copying, sharing), and it is impossible to make optimal
> decisions in every case.
>
> The comments on the refFlatten page led me to think that the problem (not
> applying either optimization) might be caused by the code that creates the
> system (see the benchmark code) by mapping over a list during which ref
> cells are being allocated.  So, I manually eliminated the map call.  This
> obviously changed the results of some analysis, eliminated the allocation
> and improved performance considerably from about 32 to about 26
> seconds.  If I interpret the generated code correctly, then the
> deepFlatten optimization was now applied, but not refFlatten, which
> should give even better results by avoiding one more indirection.
>
> It would seem that improving the analyses of the refFlatten and deepFlatten
> optimizations, so that they could be applied in more cases, could have
> significant performance benefits.

A while back, Vesa and I chatted about this issue, and I thought it would
be worth recording the conclusions in the mail archive.

For Vesa's code, the interesting bit is how the 'system' value is
represented:

   type body = {pos : Vec3D.t Ref.t, vel : Vec3D.t Ref.t, mass : Real.t}
   val system : body list =
      let
         val f = fn {pos, vel, mass} =>
            {pos = ref pos,
             vel = ref (vel |* daysPerYear),
             mass = mass * solarMass}
      in
         map f
         [{pos = {x = 0.0, y = 0.0, z = 0.0},
           vel = {x = 0.0, y = 0.0, z = 0.0},
           mass = 1.0}, ...]
      end

For the original implementation above, at the end of the SSA2 optimization
passes, the 'body list' type is represented by the SSA2 type:

   list_11 = nil_7 of (unit)
           | ::_11 of ((real64 ref * real64 ref * real64 ref)
                       * ((real64 * real64 * real64) ref)
                       * real64
                       * list_11)

Thus, a cons cell is represented by an object with 4 fields; the first
field is a pointer to a 3-field object (each field of which is a mutable
real64); the second field is a pointer to a mutable pointer to a three
field object (each field of which is a real64); the third field is a
real64; the fourth field is a list_11.
Hence, the body record type has been 'inlined' into the specific list
type.  Also, the deepFlatten pass has 'pushed' the mutability of the vel
field into the three components of a Vec3D.t.  [MLton initially orders a
record type alphabetically by field name and various passes might reverse
the order of the entire tuple; so, we know that the field next to the mass
field will be the pos field, leaving the other field to be the vel field.]

Vesa noted that the line

   ; #pos a := pos a |+| vel a |* dt

ended up allocating a tuple, leading to a high allocation total.


Vesa gave an alternate implementation, equivalent to the following:

   val system : body list =
      let
         val f = fn {pos, vel, mass} =>
            {pos = ref pos,
             vel = ref (vel |* daysPerYear),
             mass = mass * solarMass}
      in
         [f {pos = {x = 0.0, y = 0.0, z = 0.0},
             vel = {x = 0.0, y = 0.0, z = 0.0},
             mass = 1.0}, ...]
      end

For this implementation, at the end of the SSA2 optimization passes, the
'body list' type is represented by the SSA2 type:

   list_10 = nil_6 of (unit)
           | ::_10 of ((real64 ref * real64 ref * real64 ref)
                       * (real64 ref * real64 ref * real64 ref)
                       * real64
                       * list_10)

Thus, a cons cell is represented by an object with 4 fields; the first
field is a pointer to a 3-field object (each field of which is a mutable
real64); the second field is a pointer to a 3-field object (each field of
which is a mutable real64); the third field is a real64; the fourth field
is a list_11.
Hence, the body record type has been 'inlined' into the specific list
type.  Also, the deepFlatten pass has 'pushed' the mutability of the vel
field into the three components of a Vec3D.t and 'pushed' the mutability
of the pos field into the three components of a Vec3D.t.

With this implementation, the line

   ; #pos a := pos a |+| vel a |* dt

ends up not allocating anything, simply mutating each of the three
components.

Both of these SSA2 type representations are established after the
deepFlatten pass (but before the refFlatten pass).  In both, the
refFlatten pass does nothing (with respect to the 'body list' type). This
raises a couple of issues, and I don't have a complete understanding of
either of them.

First, it isn't clear why, in the original implementation, the 'vel' field
was deepFlatten-ed but the 'pos' field was not deepFlatten-ed.  There
doesn't seem to be a significant difference in the way the fields are
initialized and used in the program.  [That this was an issue didn't occur
to me earlier, Vesa and I didn't discuss it, and I don't have anything
more to say on it, other than it is an interesting question, and perhaps
the best way of achieving the desired allocation behavior.]

Second, it wasn't clear why, in the original implementation, the 'pos'
field wasn't subject to being refFlatten-ed (given that the 'pos' field
wasn't deepFlatten-ed).  That is, we might expect the refFlatten pass to
take

   list_11 = nil_7 of (unit)
           | ::_11 of ((real64 ref * real64 ref * real64 ref)
                       * ((real64 * real64 * real64) ref)
                       * real64
                       * list_11)

to

   list_11 = nil_7 of (unit)
           | ::_11 of ((real64 ref * real64 ref * real64 ref)
                       * (real64 * real64 * real64) ref
                       * real64
                       * list_11)

In retrospect, this wouldn't actually change the allocation behavior of
the program significantly.  It would save one indirection (and one
allocation at the allocation of the 'system' value), but it would still
require one tuple allocation for each iteration.  Nonetheless, it is
interesting to note why the reference isn't flattened into the containing
data structure.

One requirement for the refFlatten optimization is that the 'ref' cell
allocation occurs in the same block as the containing object allocation.
With regards to the original code above, this means that for each ::_11
allocation, there should be a 'ref' cell allocation.  Looking at the SSA2
IL code for the program into the refFlatten pass, we see two ::_11
allocations:

   L_271 (x_339: list_11,
  x_338: (real64 * real64 * real64),
  x_337: (real64 * real64 * real64),
  x_336: real64,
  x_335: list_10)
     x_342: ((real64 * real64 * real64) ref) = (x_337)
     x_2396: real64 = #2 x_338
     x_2397: real64 = #1 x_338
     x_2398: real64 = #0 x_338
     x_347: real64 = Real64_mul (global_171, x_2398)
     x_346: real64 = Real64_mul (global_171, x_2397)
     x_345: real64 = Real64_mul (global_171, x_2396)
     x_343: (real64 ref * real64 ref * real64 ref) = (x_347, x_346, x_345)
     x_341: real64 = Real64_mul (x_333, x_336)
     x_2158: ::_11 of ((real64 ref * real64 ref * real64 ref)
       * ((real64 * real64 * real64) ref)
       * real64
       * list_11) = ::_11 (x_343, x_342, x_341, x_339)
     x_340: list_11 = x_2158: list_11
     case x_335 of
       nil_6 => L_273 | ::_3 => L_1326

and

   L_274 (x_355: list_11,
  x_354: (real64 ref * real64 ref * real64 ref),
  x_353: ((real64 * real64 * real64) ref),
  x_352: real64,
  x_351: list_11)
     x_2164: ::_11 of ((real64 ref * real64 ref * real64 ref)
       * ((real64 * real64 * real64) ref)
       * real64
       * list_11) = ::_11 (x_354, x_353, x_352, x_355)
     x_356: list_11 = x_2164: list_11
     case x_351 of
       nil_7 => L_276 | ::_11 => L_1327

The first ::_11 allocation is the obvious one, that includes the ref cell
allocation (x_342 = (x_337)).  The second ::_11 allocation does not
include a ref cell allocation, so the 'pos' field is not susceptible to
refFlatten-ing.  The second ::_11 allocation corresponds to a list
reversal that is part of the implementation of map.  [A list reversal is
used so that map is implemented by two tail-recursive loops, rather than
one non-tail-recursive loop.]  In general, one can't perform the
refFlatten-ing optimization in this case.  Indeed, the ref cells are
copied from the old list to the new list.  Unless we 'know' that the old
list is dead, we can't flatten for fear of breaking sharing. (Someone
might have a copy of the old list, and we need to see the same
references.)

We can change the implementation to the following:

   val system : body list =
      let
         val f = fn {pos, vel, mass} =>
            {pos = ref pos,
             vel = ref (vel |* daysPerYear),
             mass = mass * solarMass}
      in
         foldr (fn (x,l) => (f x)::l) []
         [{pos = {x = 0.0, y = 0.0, z = 0.0},
           vel = {x = 0.0, y = 0.0, z = 0.0},
           mass = 1.0}, ...]
      end

There is still a list reversal as part of the implementation of foldr, but
now the input list is reversed.  (One can't take this as the
implementation of map, because it would apply the mapped function to the
elements in the wrong order.)  The deepFlatten pass still yields the same
list_11 representation for the 'body list' type, but now the SSA2 program
into the refFlatten pass has exactly one ::_11 allocation (nearly
alpha-equivalent to the L_271 block above).  However, the 'pos'
field is still not subject to being refFlatten-ed.

Another (syntactic) requirement of the refFlatten optimization is that the
ref cell updates occur in the same block as the selection of the ref cell
from its containing data-structure.  This requirement is violated in the
SSA2 program:

   L_389 (x_567: (real64 ref * real64 ref * real64 ref),
  x_566: ((real64 * real64 * real64) ref),
  x_565: list_10)
     x_2456: (real64 * real64 * real64) = #0 x_566
     x_2457: real64 = #2 x_567
     x_2458: real64 = #1 x_567
     x_2459: real64 = #0 x_567
     x_576: real64 = Real64_mul (x_2459, global_175)
     x_575: real64 = Real64_mul (x_2458, global_175)
     x_573: real64 = Real64_mul (x_2457, global_175)
     x_2453: real64 = #2 x_2456
     x_2454: real64 = #1 x_2456
     x_2455: real64 = #0 x_2456
     x_571: real64 = Real64_add (x_2455, x_576)
     x_570: real64 = Real64_add (x_575, x_2454)
     x_569: real64 = Real64_add (x_573, x_2453)
     x_568: (real64 * real64 * real64) = (x_571, x_570, x_569)
     x_566 := x_568
     case x_565 of
       nil_7 => L_391 | ::_11 => L_1357

The 'pos' field reference is x_566, which is passed as a first-class value
into the L_389 block, rather than selected from the cons cell within the
block.  The motivation for this syntactic requirement is to both make it
easy to find the containing data structure and to ensure that the
flattening of the ref cell into the containing data structure does not
violate space safety.

Neither of the syntactic requirements are semantic requirements -- there
exist programs that violate that the syntactic requirements that could
still validly optimized.  For example, consider the following common
idiom:

   val r = {a = ref (f x), b = ref (g y)}

If the code for g is not inlined (or includes a loop), then the ref cell
allocated for the a field will be separated from the allocation of the
record, because, in A-normal form, this code looks like:

   val r = let val fx = f x
               val ar = ref fx
               val gy = g y  (* point of possible block breaking *)
               val br = ref gy
               val r = {a = ar, b = br}
           in r end

One can restore the syntactic requirement in the above code as:

   val r = let val fx = f x
               val gy = g y
           in {a = ref fx, b = ref gy}
           end

The latter requirement is also commonly violated by programs, because, as
in the above code, components of a datatype variant are passed as
first-class values and, thus, become separated from their selection from
the containing data structure.

In any case, it seems clear that there are ways of improving the
deepFlatten and refFlatten optimizations.  Not to mention the fact that
there have been other reports of deepFlatten and refFlatten requiring
inordinate amounts of memory for large input programs.

_______________________________________________
MLton mailing list
MLton@...
http://mlton.org/mailman/listinfo/mlton
LightInTheBox - Buy quality products at wholesale price!