|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
N-Body benchmark and the refFlatten and the deepFlatten optimizationsWhile changing my n-body benchmark implementation in mltonlib
http://mlton.org/cgi-bin/viewsvn.cgi/mltonlib/trunk/org/mlton/vesak/toys/n-body/ to use a separate library for 3d vector arithmetic, I realized that there was basically no reason why the benchmark should be allocating memory. Yet, gc-summary showed that it allocated a total of about 3G bytes. So, I spent some time reading the generated assembly code (actually, I first tried allocation profiling, but, in retrospect, failed to interpret the results properly to find the exact place of unwanted allocation) and noticed that the pos field of a body (see the benchmark code) seemed to be using the traditional "pointer to mutable pointer to value" representation. This meant that the line ; #pos a := pos a |+| vel a |* dt allocated one vector per iteration of the advance function. This also meant that neither the refFlatten (http://mlton.org/RefFlatten) nor the deepFlatten (http://mlton.org/DeepFlatten) optimizations were applied. Either one of these optimizations should have eliminated the allocation, AFAIU. I'm not sure what exactly causes the optimizations to be missed. I would assume that refFlatten is trickier to apply and requires a more complex analysis. The deepFlatten page does not explain the associated analysis in any way (and I haven't looked at the implementation of deepFlatten). I find it slightly surprising that the deepFlatten optimization is not applied in this case. The objects are finite sized and not very large (3 reals (+ overhead)). Shouldn't this make it safe (for space) to apply deepFlatten? Of course there are various trade-offs here (between allocation, copying, sharing), and it is impossible to make optimal decisions in every case. The comments on the refFlatten page led me to think that the problem (not applying either optimization) might be caused by the code that creates the system (see the benchmark code) by mapping over a list during which ref cells are being allocated. So, I manually eliminated the map call. This obviously changed the results of some analysis, eliminated the allocation and improved performance considerably from about 32 to about 26 seconds. If I interpret the generated code correctly, then the deepFlatten optimization was now applied, but not refFlatten, which should give even better results by avoiding one more indirection. It would seem that improving the analyses of the refFlatten and deepFlatten optimizations, so that they could be applied in more cases, could have significant performance benefits. Below is a patch to replace the system definition in the benchmark, which improves performance significantly by elimination allocation as discussed. -Vesa Karvonen Index: n-body.sml =================================================================== --- n-body.sml (revision 6665) +++ n-body.sml (working copy) @@ -36,41 +36,37 @@ fun vel (b : body) = ! (#vel b) val system = - map (fn {pos, vel, mass} => - {pos = ref pos, - vel = ref (vel |* daysPerYear), - mass = mass * solarMass}) - [{pos = {x = 0.0, y = 0.0, z = 0.0}, - vel = {x = 0.0, y = 0.0, z = 0.0}, - mass = 1.0}, - {pos = {x = 4.84143144246472090, + [{pos = ref {x = 0.0, y = 0.0, z = 0.0}, + vel = ref {x = 0.0, y = 0.0, z = 0.0}, + mass = solarMass}, + {pos = ref {x = 4.84143144246472090, y = ~1.16032004402742839, z = ~1.03622044471123109e~1}, - vel = {x = 1.66007664274403694e~3, - y = 7.69901118419740425e~3, - z = ~6.90460016972063023e~5}, - mass = 9.54791938424326609e~4}, - {pos = {x = 8.34336671824457987, + vel = ref ({x = 1.66007664274403694e~3, + y = 7.69901118419740425e~3, + z = ~6.90460016972063023e~5} |* daysPerYear), + mass = 9.54791938424326609e~4 * solarMass}, + {pos = ref {x = 8.34336671824457987, y = 4.12479856412430479, z = ~4.03523417114321381e~1}, - vel = {x = ~2.76742510726862411e~3, - y = 4.99852801234917238e~3, - z = 2.30417297573763929e~5}, - mass = 2.85885980666130812e~4}, - {pos = {x = 1.28943695621391310e1, + vel = ref ({x = ~2.76742510726862411e~3, + y = 4.99852801234917238e~3, + z = 2.30417297573763929e~5} |* daysPerYear), + mass = 2.85885980666130812e~4 * solarMass}, + {pos = ref {x = 1.28943695621391310e1, y = ~1.51111514016986312e1, z = ~2.23307578892655734e~1}, - vel = {x = 2.96460137564761618e~3, - y = 2.37847173959480950e~3, - z = ~2.96589568540237556e~5}, - mass = 4.36624404335156298e~5}, - {pos = {x = 1.53796971148509165e1, + vel = ref ({x = 2.96460137564761618e~3, + y = 2.37847173959480950e~3, + z = ~2.96589568540237556e~5} |* daysPerYear), + mass = 4.36624404335156298e~5 * solarMass}, + {pos = ref {x = 1.53796971148509165e1, y = ~2.59193146099879641e1, z = 1.79258772950371181e~1}, - vel = {x = 2.68067772490389322e~3, - y = 1.62824170038242295e~3, - z = ~9.51592254519715870e~5}, - mass = 5.15138902046611451e~5}] + vel = ref ({x = 2.68067772490389322e~3, + y = 1.62824170038242295e~3, + z = ~9.51592254519715870e~5} |* daysPerYear), + mass = 5.15138902046611451e~5 * solarMass}] fun advance dt = fn [] => () _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
|
|
Re: N-Body benchmark and the refFlatten and the deepFlatten optimizationsOn Wed, 2 Jul 2008, Vesa Karvonen wrote:
> I'm not sure what exactly causes the optimizations to be missed. I would > assume that refFlatten is trickier to apply and requires a more complex > analysis. The deepFlatten page does not explain the associated analysis > in any way (and I haven't looked at the implementation of deepFlatten). I > find it slightly surprising that the deepFlatten optimization is not > applied in this case. The objects are finite sized and not very large (3 > reals (+ overhead)). Shouldn't this make it safe (for space) to apply > deepFlatten? Of course there are various trade-offs here (between > allocation, copying, sharing), and it is impossible to make optimal > decisions in every case. > > The comments on the refFlatten page led me to think that the problem (not > applying either optimization) might be caused by the code that creates the > system (see the benchmark code) by mapping over a list during which ref > cells are being allocated. So, I manually eliminated the map call. This > obviously changed the results of some analysis, eliminated the allocation > and improved performance considerably from about 32 to about 26 > seconds. If I interpret the generated code correctly, then the > deepFlatten optimization was now applied, but not refFlatten, which > should give even better results by avoiding one more indirection. > > It would seem that improving the analyses of the refFlatten and deepFlatten > optimizations, so that they could be applied in more cases, could have > significant performance benefits. A while back, Vesa and I chatted about this issue, and I thought it would be worth recording the conclusions in the mail archive. For Vesa's code, the interesting bit is how the 'system' value is represented: type body = {pos : Vec3D.t Ref.t, vel : Vec3D.t Ref.t, mass : Real.t} val system : body list = let val f = fn {pos, vel, mass} => {pos = ref pos, vel = ref (vel |* daysPerYear), mass = mass * solarMass} in map f [{pos = {x = 0.0, y = 0.0, z = 0.0}, vel = {x = 0.0, y = 0.0, z = 0.0}, mass = 1.0}, ...] end For the original implementation above, at the end of the SSA2 optimization passes, the 'body list' type is represented by the SSA2 type: list_11 = nil_7 of (unit) | ::_11 of ((real64 ref * real64 ref * real64 ref) * ((real64 * real64 * real64) ref) * real64 * list_11) Thus, a cons cell is represented by an object with 4 fields; the first field is a pointer to a 3-field object (each field of which is a mutable real64); the second field is a pointer to a mutable pointer to a three field object (each field of which is a real64); the third field is a real64; the fourth field is a list_11. Hence, the body record type has been 'inlined' into the specific list type. Also, the deepFlatten pass has 'pushed' the mutability of the vel field into the three components of a Vec3D.t. [MLton initially orders a record type alphabetically by field name and various passes might reverse the order of the entire tuple; so, we know that the field next to the mass field will be the pos field, leaving the other field to be the vel field.] Vesa noted that the line ; #pos a := pos a |+| vel a |* dt ended up allocating a tuple, leading to a high allocation total. Vesa gave an alternate implementation, equivalent to the following: val system : body list = let val f = fn {pos, vel, mass} => {pos = ref pos, vel = ref (vel |* daysPerYear), mass = mass * solarMass} in [f {pos = {x = 0.0, y = 0.0, z = 0.0}, vel = {x = 0.0, y = 0.0, z = 0.0}, mass = 1.0}, ...] end For this implementation, at the end of the SSA2 optimization passes, the 'body list' type is represented by the SSA2 type: list_10 = nil_6 of (unit) | ::_10 of ((real64 ref * real64 ref * real64 ref) * (real64 ref * real64 ref * real64 ref) * real64 * list_10) Thus, a cons cell is represented by an object with 4 fields; the first field is a pointer to a 3-field object (each field of which is a mutable real64); the second field is a pointer to a 3-field object (each field of which is a mutable real64); the third field is a real64; the fourth field is a list_11. Hence, the body record type has been 'inlined' into the specific list type. Also, the deepFlatten pass has 'pushed' the mutability of the vel field into the three components of a Vec3D.t and 'pushed' the mutability of the pos field into the three components of a Vec3D.t. With this implementation, the line ; #pos a := pos a |+| vel a |* dt ends up not allocating anything, simply mutating each of the three components. Both of these SSA2 type representations are established after the deepFlatten pass (but before the refFlatten pass). In both, the refFlatten pass does nothing (with respect to the 'body list' type). This raises a couple of issues, and I don't have a complete understanding of either of them. First, it isn't clear why, in the original implementation, the 'vel' field was deepFlatten-ed but the 'pos' field was not deepFlatten-ed. There doesn't seem to be a significant difference in the way the fields are initialized and used in the program. [That this was an issue didn't occur to me earlier, Vesa and I didn't discuss it, and I don't have anything more to say on it, other than it is an interesting question, and perhaps the best way of achieving the desired allocation behavior.] Second, it wasn't clear why, in the original implementation, the 'pos' field wasn't subject to being refFlatten-ed (given that the 'pos' field wasn't deepFlatten-ed). That is, we might expect the refFlatten pass to take list_11 = nil_7 of (unit) | ::_11 of ((real64 ref * real64 ref * real64 ref) * ((real64 * real64 * real64) ref) * real64 * list_11) to list_11 = nil_7 of (unit) | ::_11 of ((real64 ref * real64 ref * real64 ref) * (real64 * real64 * real64) ref * real64 * list_11) In retrospect, this wouldn't actually change the allocation behavior of the program significantly. It would save one indirection (and one allocation at the allocation of the 'system' value), but it would still require one tuple allocation for each iteration. Nonetheless, it is interesting to note why the reference isn't flattened into the containing data structure. One requirement for the refFlatten optimization is that the 'ref' cell allocation occurs in the same block as the containing object allocation. With regards to the original code above, this means that for each ::_11 allocation, there should be a 'ref' cell allocation. Looking at the SSA2 IL code for the program into the refFlatten pass, we see two ::_11 allocations: L_271 (x_339: list_11, x_338: (real64 * real64 * real64), x_337: (real64 * real64 * real64), x_336: real64, x_335: list_10) x_342: ((real64 * real64 * real64) ref) = (x_337) x_2396: real64 = #2 x_338 x_2397: real64 = #1 x_338 x_2398: real64 = #0 x_338 x_347: real64 = Real64_mul (global_171, x_2398) x_346: real64 = Real64_mul (global_171, x_2397) x_345: real64 = Real64_mul (global_171, x_2396) x_343: (real64 ref * real64 ref * real64 ref) = (x_347, x_346, x_345) x_341: real64 = Real64_mul (x_333, x_336) x_2158: ::_11 of ((real64 ref * real64 ref * real64 ref) * ((real64 * real64 * real64) ref) * real64 * list_11) = ::_11 (x_343, x_342, x_341, x_339) x_340: list_11 = x_2158: list_11 case x_335 of nil_6 => L_273 | ::_3 => L_1326 and L_274 (x_355: list_11, x_354: (real64 ref * real64 ref * real64 ref), x_353: ((real64 * real64 * real64) ref), x_352: real64, x_351: list_11) x_2164: ::_11 of ((real64 ref * real64 ref * real64 ref) * ((real64 * real64 * real64) ref) * real64 * list_11) = ::_11 (x_354, x_353, x_352, x_355) x_356: list_11 = x_2164: list_11 case x_351 of nil_7 => L_276 | ::_11 => L_1327 The first ::_11 allocation is the obvious one, that includes the ref cell allocation (x_342 = (x_337)). The second ::_11 allocation does not include a ref cell allocation, so the 'pos' field is not susceptible to refFlatten-ing. The second ::_11 allocation corresponds to a list reversal that is part of the implementation of map. [A list reversal is used so that map is implemented by two tail-recursive loops, rather than one non-tail-recursive loop.] In general, one can't perform the refFlatten-ing optimization in this case. Indeed, the ref cells are copied from the old list to the new list. Unless we 'know' that the old list is dead, we can't flatten for fear of breaking sharing. (Someone might have a copy of the old list, and we need to see the same references.) We can change the implementation to the following: val system : body list = let val f = fn {pos, vel, mass} => {pos = ref pos, vel = ref (vel |* daysPerYear), mass = mass * solarMass} in foldr (fn (x,l) => (f x)::l) [] [{pos = {x = 0.0, y = 0.0, z = 0.0}, vel = {x = 0.0, y = 0.0, z = 0.0}, mass = 1.0}, ...] end There is still a list reversal as part of the implementation of foldr, but now the input list is reversed. (One can't take this as the implementation of map, because it would apply the mapped function to the elements in the wrong order.) The deepFlatten pass still yields the same list_11 representation for the 'body list' type, but now the SSA2 program into the refFlatten pass has exactly one ::_11 allocation (nearly alpha-equivalent to the L_271 block above). However, the 'pos' field is still not subject to being refFlatten-ed. Another (syntactic) requirement of the refFlatten optimization is that the ref cell updates occur in the same block as the selection of the ref cell from its containing data-structure. This requirement is violated in the SSA2 program: L_389 (x_567: (real64 ref * real64 ref * real64 ref), x_566: ((real64 * real64 * real64) ref), x_565: list_10) x_2456: (real64 * real64 * real64) = #0 x_566 x_2457: real64 = #2 x_567 x_2458: real64 = #1 x_567 x_2459: real64 = #0 x_567 x_576: real64 = Real64_mul (x_2459, global_175) x_575: real64 = Real64_mul (x_2458, global_175) x_573: real64 = Real64_mul (x_2457, global_175) x_2453: real64 = #2 x_2456 x_2454: real64 = #1 x_2456 x_2455: real64 = #0 x_2456 x_571: real64 = Real64_add (x_2455, x_576) x_570: real64 = Real64_add (x_575, x_2454) x_569: real64 = Real64_add (x_573, x_2453) x_568: (real64 * real64 * real64) = (x_571, x_570, x_569) x_566 := x_568 case x_565 of nil_7 => L_391 | ::_11 => L_1357 The 'pos' field reference is x_566, which is passed as a first-class value into the L_389 block, rather than selected from the cons cell within the block. The motivation for this syntactic requirement is to both make it easy to find the containing data structure and to ensure that the flattening of the ref cell into the containing data structure does not violate space safety. Neither of the syntactic requirements are semantic requirements -- there exist programs that violate that the syntactic requirements that could still validly optimized. For example, consider the following common idiom: val r = {a = ref (f x), b = ref (g y)} If the code for g is not inlined (or includes a loop), then the ref cell allocated for the a field will be separated from the allocation of the record, because, in A-normal form, this code looks like: val r = let val fx = f x val ar = ref fx val gy = g y (* point of possible block breaking *) val br = ref gy val r = {a = ar, b = br} in r end One can restore the syntactic requirement in the above code as: val r = let val fx = f x val gy = g y in {a = ref fx, b = ref gy} end The latter requirement is also commonly violated by programs, because, as in the above code, components of a datatype variant are passed as first-class values and, thus, become separated from their selection from the containing data structure. In any case, it seems clear that there are ways of improving the deepFlatten and refFlatten optimizations. Not to mention the fact that there have been other reports of deepFlatten and refFlatten requiring inordinate amounts of memory for large input programs. _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
| Free Forum Powered by Nabble | Forum Help |