|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Constant Folding of FP OperationsAttached is an experimental patch that improves MLton's constant folding
of floating point operations. The problem with constant folding FP operations in SML is that the FP ops are subject to rounding mode settings: http://www.standardml.org/Basis/ieee-float.html#SIG:IEEE_REAL.setRoundingMode:VAL The workaround used in the patch is to evaluate the operations in all rounding modes (actually in only TO_NEGINF and TO_POSINF) and check that the results agree. This ensures that constant folding is correct in all rounding modes. Personally I'm inclined to think that the inclusion of get/setRoundingMode in the Basis library is a design mistake, because it makes some FP optimizations very difficult. An alternative design would be to have multiple modules, say Real[ToNearest], RealToNegInf, RealToPosInf, and RealToZero, each corresponding to a specific rounding mode. FP ops could then be treated as functional and it would then likely become straightforward to make the same kind of aggressive constant folding and simplification of FP ops as is done for integer ops. Of course, this would introduce the need for the compiler to insert code to switch the rounding modes when necessary, but this doesn't seem extremely difficult. About the patch. The compiler changes do not compile with SML/NJ, which does not support 32-bit floating point (Real32), and it should be made so that the FP simplifications aren't performed (and would be replaced with dummy functions) when the compiler has been compiled with SML/NJ. This is the reason why I consider this an experimental patch and haven't committed it. Some of the floating point simplifications should probably also be disabled when cross compiling. These should not be very difficult to do, but it seems that I'm unlikely get around to implementing them. The "decon" datatype has some constructors that aren't currently used (although they are produced by the decon function) in the patch. I planned those constructors for some optimizations (e.g. replace divide by a power of 2 with a multiply) that aren't currently possible with MLton's Prim.apply, because it does not allow the introduction of an operation with constants in the result of Prim.apply. This restriction of Prim.apply could probably be lifted without huge problems. The limitation is probably due to the design of the SSA and SSA2 intermediate languages of MLton that do not allow constants as operands. The purpose of the changes to real.sml in the basis library implementation is to expose opportunities for constant folding. Calls to the "class" function cannot be optimized away (by the current compiler). -Vesa Karvonen [real-cf.patch] Index: mlton/atoms/real-x.sig =================================================================== --- mlton/atoms/real-x.sig (revision 6625) +++ mlton/atoms/real-x.sig (working copy) @@ -10,6 +10,7 @@ signature REAL_X_STRUCTS = sig structure RealSize: REAL_SIZE + structure WordX: WORD_X end signature REAL_X = @@ -19,11 +20,46 @@ (* reals of all RealSize.t sizes. *) type t + datatype decon = + NAN + | ZERO of {signBit: bool} + | ONE of {signBit: bool} + | POW2 of {signBit: bool, exp: int} (* man = 0.5 *) + | FIN of {signBit: bool, exp: int, man: t} + | INF of {signBit: bool} + + val abs: t -> t + val acos: t -> t option + val add: t * t -> t option + val asin: t -> t option + val atan2: t * t -> t option + val atan: t -> t option + val castFromWord: WordX.t -> t + val castToWord: t -> WordX.t + val cos: t -> t option + val decon: t -> decon + val div: t * t -> t option + val equal: t * t -> bool val equals: t * t -> bool + val exp: t -> t option + val fromIntInf: IntInf.t * RealSize.t -> t option val hash: t -> word val layout: t -> Layout.t + val le: t * t -> bool + val ln: t -> t option + val log10: t -> t option + val lt: t * t -> bool val make: string * RealSize.t -> t option + val mul: t * t -> t option + val muladd: t * t * t -> t option + val mulsub: t * t * t -> t option + val neg: t -> t + val qequal: t * t -> bool + val sin: t -> t option val size: t -> RealSize.t + val sqrt: t -> t option + val sub: t * t -> t option + val tan: t -> t option val toString: t -> string val zero: RealSize.t -> t end Index: mlton/atoms/atoms.fun =================================================================== --- mlton/atoms/atoms.fun (revision 6625) +++ mlton/atoms/atoms.fun (working copy) @@ -24,8 +24,9 @@ structure Con = Con () structure CType = CType (structure RealSize = RealSize structure WordSize = WordSize) - structure RealX = RealX (structure RealSize = RealSize) structure WordX = WordX (structure WordSize = WordSize) + structure RealX = RealX (structure RealSize = RealSize + structure WordX = WordX) structure WordXVector = WordXVector (structure WordSize = WordSize structure WordX = WordX) structure Func = Index: mlton/atoms/real-x.fun =================================================================== --- mlton/atoms/real-x.fun (revision 6625) +++ mlton/atoms/real-x.fun (working copy) @@ -70,4 +70,188 @@ val hash = String.hash o toString +local + fun make (o32, o64) = + fn Real32 x => Real32 (o32 x) + | Real64 x => Real64 (o64 x) +in + val neg = make (Real32.~, Real64.~) + val abs = make (Real32.abs, Real64.abs) end + +structure P = Pervasive +structure PR32 = P.Real32 +structure PR64 = P.Real64 +structure PIR = P.IEEEReal + +datatype 'r r = + R of {zero: 'r, half: 'r, one: 'r, inf: 'r, abs: 'r -> 'r, + signBit: 'r -> bool, isNan: 'r -> bool, + toManExp: 'r -> {exp: int, man: 'r}, + compareReal: 'r * 'r -> PIR.real_order, + bits: Bits.t, + subVec: P.Word8Vector.vector * int -> P.LargeWord.word, + update: P.Word8Array.array * int * P.LargeWord.word -> unit, + toBytes: 'r -> P.Word8Vector.vector, + subArr: P.Word8Array.array * int -> 'r, + tag: 'r -> t} + +val r32 = + R {zero = 0.0, half = 0.5, one = 1.0, inf = PR32.posInf, + abs = PR32.abs, signBit = PR32.signBit, isNan = PR32.isNan, + toManExp = PR32.toManExp, compareReal = PR32.compareReal, + bits = Bits.inWord32, + subVec = P.PackWord32Little.subVec, + update = P.PackWord32Little.update, + toBytes = P.PackReal32Little.toBytes, + subArr = P.PackReal32Little.subArr, + tag = Real32} +val r64 = + R {zero = 0.0, half = 0.5, one = 1.0, inf = PR64.posInf, + abs = PR64.abs, signBit = PR64.signBit, isNan = PR64.isNan, + toManExp = PR64.toManExp, compareReal = PR64.compareReal, + bits = Bits.inWord64, + subVec = P.PackWord64Little.subVec, + update = P.PackWord64Little.update, + toBytes = P.PackReal64Little.toBytes, + subArr = P.PackReal64Little.subArr, + tag = Real64} + +local + fun doit (R {compareReal, signBit, isNan, tag, ...}) (f, arg) = + case !Control.target of + Control.Cross _ => NONE + | Control.Self => let + val old = PIR.getRoundingMode () + val () = PIR.setRoundingMode PIR.TO_NEGINF + val min = f arg + val () = PIR.setRoundingMode PIR.TO_POSINF + val max = f arg + val () = PIR.setRoundingMode old + in + if PIR.EQUAL = compareReal (min, max) + andalso signBit min = signBit max + orelse isNan min andalso isNan max + then SOME (tag min) + else NONE + end + + fun make1 (o32, o64) = + fn Real32 x => doit r32 (o32, x) + | Real64 x => doit r64 (o64, x) + + fun make2 (o32, o64) = + fn (Real32 x, Real32 y) => doit r32 (o32, (x, y)) + | (Real64 x, Real64 y) => doit r64 (o64, (x, y)) + | _ => Error.bug "impossible" + + fun make3 (o32, o64) = + fn (Real32 x, Real32 y, Real32 z) => doit r32 (o32, (x, y, z)) + | (Real64 x, Real64 y, Real64 z) => doit r64 (o64, (x, y, z)) + | _ => Error.bug "impossible" +in + val acos = make1 (PR32.Math.acos, PR64.Math.acos) + val asin = make1 (PR32.Math.asin, PR64.Math.asin) + val atan = make1 (PR32.Math.atan, PR64.Math.atan) + val atan2 = make2 (PR32.Math.atan2, PR64.Math.atan2) + val cos = make1 (PR32.Math.cos, PR64.Math.cos) + val exp = make1 (PR32.Math.exp, PR64.Math.exp) + val ln = make1 (PR32.Math.ln, PR64.Math.ln) + val log10 = make1 (PR32.Math.log10, PR64.Math.log10) + val sin = make1 (PR32.Math.sin, PR64.Math.sin) + val sqrt = make1 (PR32.Math.sqrt, PR64.Math.sqrt) + val tan = make1 (PR32.Math.tan, PR64.Math.tan) + + val add = make2 (PR32.+, PR64.+) + val op div = make2 (PR32./, PR64./) + val mul = make2 (PR32.*, PR64.* ) + val sub = make2 (PR32.-, PR64.-) + + val muladd = make3 (PR32.*+, PR64.*+) + val mulsub = make3 (PR32.*-, PR64.*-) + + fun fromIntInf (i, s) = + case s of + R32 => doit r32 (Real32.fromIntInf, i) + | R64 => doit r64 (Real64.fromIntInf, i) +end + +local + fun make (o32, o64) = + fn (Real32 r1, Real32 r2) => o32 (r1, r2) + | (Real64 r1, Real64 r2) => o64 (r1, r2) + | _ => Error.bug "impossible" +in + val equal = make (PR32.==, PR64.==) + val le = make (PR32.<=, PR64.<=) + val lt = make (PR32.<, PR64.<) + val qequal = make (PR32.?=, PR64.?=) +end + +datatype decon = + NAN + | ZERO of {signBit: bool} + | ONE of {signBit: bool} + | POW2 of {signBit: bool, exp: Int.t} (* man = 0.5 *) + | FIN of {signBit: bool, exp: Int.t, man: t} + | INF of {signBit: bool} + +local + fun doit (R {zero, half, one, inf, abs, signBit, isNan, toManExp, + compareReal, tag, ...}) + value = + if isNan value + then NAN + else let + val signBit = signBit value + val absValue = abs value + in + if PIR.EQUAL = compareReal (zero, absValue) + then ZERO {signBit = signBit} + else if PIR.EQUAL = compareReal (one, absValue) + then ONE {signBit = signBit} + else if PIR.EQUAL = compareReal (inf, absValue) + then INF {signBit = signBit} + else let + val {man, exp} = toManExp absValue + in + if PIR.EQUAL = compareReal (half, man) + then POW2 {signBit = signBit, exp = exp} + else FIN {signBit = signBit, exp = exp, man = tag man} + end + end +in + val decon = + fn Real32 x => doit r32 x + | Real64 x => doit r64 x +end + +local + fun doit (R {bits, toBytes, subVec, ...}) x = + WordX.fromIntInf + (P.LargeWord.toLargeInt (subVec (toBytes x, 0)), + WordX.WordSize.fromBits bits) +in + val castToWord = + fn Real32 x => doit r32 x + | Real64 x => doit r64 x +end + +local + fun doit (R {bits, update, subArr, tag, ...}) w = let + val a = P.Word8Array.array (Bytes.toInt (Bits.toBytes bits), 0w0) + in + update (a, 0, P.LargeWord.fromLargeInt (WordX.toIntInf w)) + ; tag (subArr (a, 0)) + end +in + fun castFromWord w = + if WordX.WordSize.bits (WordX.size w) = Bits.inWord32 then + doit r32 w + else if WordX.WordSize.bits (WordX.size w) = Bits.inWord64 then + doit r64 w + else + Error.bug "Invalid word size" +end + +end Index: mlton/atoms/sources.cm =================================================================== --- mlton/atoms/sources.cm (revision 6625) +++ mlton/atoms/sources.cm (working copy) @@ -52,10 +52,10 @@ (* Windows doesn't like files named con, so use con- instead. *) con-.sig con-.fun +word-x.sig +word-x.fun real-x.sig real-x.fun -word-x.sig -word-x.fun word-x-vector.sig word-x-vector.fun c-type.sig Index: mlton/atoms/sources.mlb =================================================================== --- mlton/atoms/sources.mlb (revision 6625) +++ mlton/atoms/sources.mlb (working copy) @@ -16,10 +16,10 @@ (* Windows doesn't like files named con, so use con- instead. *) con-.sig con-.fun + word-x.sig + word-x.fun real-x.sig real-x.fun - word-x.sig - word-x.fun word-x-vector.sig word-x-vector.fun c-type.sig Index: mlton/atoms/prim.fun =================================================================== --- mlton/atoms/prim.fun (revision 6625) +++ mlton/atoms/prim.fun (working copy) @@ -21,6 +21,7 @@ local open Const in + structure RealX = RealX structure WordX = WordX structure WordXVector = WordXVector end @@ -1277,6 +1278,12 @@ else ApplyResult.Const (Const.intInf ii) val intInfConst = intInf o IntInf.fromInt val null = ApplyResult.Const Const.null + fun real (r: RealX.t): ('a, 'b) ApplyResult.t = + ApplyResult.Const (Const.real r) + fun realNeg (s, x): ('a, 'b) ApplyResult.t = + ApplyResult.Apply (Real_neg s, [x]) + fun realOpt (ro: RealX.t option): ('a, 'b) ApplyResult.t = + case ro of NONE => ApplyResult.Unknown | SOME r => real r fun word (w: WordX.t): ('a, 'b) ApplyResult.t = ApplyResult.Const (Const.word w) val f = ApplyResult.falsee @@ -1394,6 +1401,38 @@ seqIndexConst (IntInf.fromInt (WordXVector.length v)) | (Vector_sub, [WordVector v, Word i]) => word (WordXVector.sub (v, WordX.toInt i)) + | (Real_neg _, [Real r]) => real (RealX.neg r) + | (Real_abs _, [Real r]) => real (RealX.abs r) + | (Real_Math_acos _, [Real r]) => realOpt (RealX.acos r) + | (Real_Math_asin _, [Real r]) => realOpt (RealX.asin r) + | (Real_Math_atan _, [Real r]) => realOpt (RealX.atan r) + | (Real_Math_atan2 _, [Real r1, Real r2]) => + realOpt (RealX.atan2 (r1, r2)) + | (Real_Math_cos _, [Real r]) => realOpt (RealX.cos r) + | (Real_Math_exp _, [Real r]) => realOpt (RealX.exp r) + | (Real_Math_ln _, [Real r]) => realOpt (RealX.ln r) + | (Real_Math_log10 _, [Real r]) => realOpt (RealX.log10 r) + | (Real_Math_sin _, [Real r]) => realOpt (RealX.sin r) + | (Real_Math_sqrt _, [Real r]) => realOpt (RealX.sqrt r) + | (Real_Math_tan _, [Real r]) => realOpt (RealX.tan r) + | (Real_add _, [Real r1, Real r2]) => realOpt (RealX.add (r1, r2)) + | (Real_div _, [Real r1, Real r2]) => realOpt (RealX.div (r1, r2)) + | (Real_mul _, [Real r1, Real r2]) => realOpt (RealX.mul (r1, r2)) + | (Real_sub _, [Real r1, Real r2]) => realOpt (RealX.sub (r1, r2)) + | (Real_muladd _, [Real r1, Real r2, Real r3]) => + realOpt (RealX.muladd (r1, r2, r3)) + | (Real_mulsub _, [Real r1, Real r2, Real r3]) => + realOpt (RealX.mulsub (r1, r2, r3)) + | (Real_equal _, [Real r1, Real r2]) => bool (RealX.equal (r1, r2)) + | (Real_le _, [Real r1, Real r2]) => bool (RealX.le (r1, r2)) + | (Real_lt _, [Real r1, Real r2]) => bool (RealX.lt (r1, r2)) + | (Real_qequal _, [Real r1, Real r2]) => bool (RealX.qequal (r1, r2)) + | (Real_castToWord _, [Real r]) => word (RealX.castToWord r) + | (Word_castToReal _, [Word w]) => real (RealX.castFromWord w) + | (Word_rndToReal (_, s, {signed}), [Word w]) => + realOpt + (RealX.fromIntInf + (if signed then WordX.toIntInfX w else WordX.toIntInf w, s)) | (Word_add _, [Word w1, Word w2]) => word (WordX.add (w1, w2)) | (Word_addCheck s, [Word w1, Word w2]) => wcheck (op +, s, w1, w2) | (Word_andb _, [Word w1, Word w2]) => word (WordX.andb (w1, w2)) @@ -1490,6 +1529,38 @@ else Unknown | _ => Unknown end handle Exn.Overflow => Unknown + fun varReal (x, r, inOrder) = + let + datatype z = datatype RealX.decon + datatype z = datatype ApplyResult.t + fun negIf (s, signBit) = + if signBit then realNeg (s, x) else Var x + in + case RealX.decon r of + ZERO _ => Unknown + | ONE {signBit} => + (case p of + Real_mul s => negIf (s, signBit) + | Real_div s => if inOrder + then negIf (s, signBit) + else Unknown + | _ => Unknown) + | NAN => + (case p of + Real_Math_atan2 _ => real r + | Real_add _ => real r + | Real_div _ => real r + | Real_mul _ => real r + | Real_sub _ => real r + | Real_equal _ => bool false + | Real_qequal _ => bool true + | Real_le _ => bool false + | Real_lt _ => bool false + | _ => Unknown) + | POW2 _ => Unknown + | INF _ => Unknown + | FIN _ => Unknown + end fun varWord (x, w, inOrder) = let val zero = word o WordX.zero @@ -1627,6 +1698,8 @@ else bool true else bool false else Unknown + | (_, [Var x, Const (Real r)]) => varReal (x, r, true) + | (_, [Const (Real r), Var x]) => varReal (x, r, false) | (_, [Var x, Const (Word i)]) => varWord (x, i, true) | (_, [Const (Word i), Var x]) => varWord (x, i, false) | (_, [Const (IntInf i1), Const (IntInf i2), _]) => Index: mlton/atoms/const.sig =================================================================== --- mlton/atoms/const.sig (revision 6625) +++ mlton/atoms/const.sig (working copy) @@ -13,7 +13,7 @@ structure RealX: REAL_X structure WordX: WORD_X structure WordXVector: WORD_X_VECTOR - sharing WordX = WordXVector.WordX + sharing WordX = RealX.WordX = WordXVector.WordX end signature CONST = Index: mlton/codegen/c-codegen/c-codegen.fun =================================================================== --- mlton/codegen/c-codegen/c-codegen.fun (revision 6625) +++ mlton/codegen/c-codegen/c-codegen.fun (working copy) @@ -45,12 +45,19 @@ fun toC (r: t): string = let - (* The only difference between SML reals and C floats/doubles is that + (* The main difference between SML reals and C floats/doubles is that * SML uses "~" while C uses "-". *) val s = String.translate (toString r, fn #"~" => "-" | c => String.fromChar c) + (* Also, inf is spelled INFINITY and nan is NAN in C. *) + val s = + case s of + "-inf" => "-INFINITY" + | "inf" => "INFINITY" + | "nan" => "NAN" + | other => other in case size r of R32 => concat ["(Real32)", s] Index: basis-library/real/real.sml =================================================================== --- basis-library/real/real.sml (revision 6625) +++ basis-library/real/real.sml (working copy) @@ -116,10 +116,7 @@ | _ => if signBit x then ~x else x fun isFinite r = - case class r of - INF => false - | NAN => false - | _ => true + abs r <= maxFinite val op == = Prim.== @@ -153,10 +150,10 @@ else y fun sign (x: real): int = - case class x of - NAN => raise Domain - | ZERO => 0 - | _ => if x > zero then 1 else ~1 + if x > zero then 1 + else if x < zero then ~1 + else if x == zero then 0 + else raise Domain fun sameSign (x, y) = signBit x = signBit y @@ -266,10 +263,9 @@ val realMod = #frac o split fun checkFloat x = - case class x of - INF => raise Overflow - | NAN => raise Div - | _ => x + if isFinite x then x + else if isNan x then raise Div + else raise Overflow fun roundReal (x: real, m: rounding_mode): real = IEEEReal.withRoundingMode (m, fn () => R.round x) @@ -623,61 +619,72 @@ minInt': 'a, precision': int}} = (fromIntUnsafe, - if Int.< (precision, #precision' other) - then let - val maxInt' = #maxInt' other - val minInt' = #minInt' other - (* maxInt can't be represented exactly. *) - (* minInt can be represented exactly. *) - val (maxInt,minInt) = - IEEEReal.withRoundingMode - (TO_ZERO, fn () => (fromIntUnsafe maxInt', - fromIntUnsafe minInt')) - in - fn (m: rounding_mode) => fn x => - case class x of - INF => raise Overflow - | NAN => raise Domain - | _ => if minInt <= x andalso x <= maxInt - then toIntUnsafe (roundReal (x, m)) - else raise Overflow - end - else let - val maxInt' = #maxInt' other - val minInt' = #minInt' other - val maxInt = fromIntUnsafe maxInt' - val minInt = fromIntUnsafe minInt' - in - fn (m: rounding_mode) => fn x => - case class x of - INF => raise Overflow - | NAN => raise Domain - | _ => if minInt <= x - then if x <= maxInt - then toIntUnsafe (roundReal (x, m)) - else if x < maxInt + one - then (case m of - TO_NEGINF => maxInt' - | TO_POSINF => raise Overflow - | TO_ZERO => maxInt' - | TO_NEAREST => - (* Depends on maxInt being odd. *) - if x - maxInt >= half - then raise Overflow - else maxInt') - else raise Overflow - else if x > minInt - one - then (case m of - TO_NEGINF => raise Overflow - | TO_POSINF => minInt' - | TO_ZERO => minInt' - | TO_NEAREST => - (* Depends on minInt being even. *) - if x - minInt < ~half - then raise Overflow - else minInt') - else raise Overflow - end) + if Int.< (precision, #precision' other) then + let + val maxInt' = #maxInt' other + val minInt' = #minInt' other + (* maxInt can't be represented exactly. *) + (* minInt can be represented exactly. *) + val (maxInt,minInt) = + IEEEReal.withRoundingMode + (TO_ZERO, fn () => (fromIntUnsafe maxInt', + fromIntUnsafe minInt')) + in + fn (m: rounding_mode) => fn x => + if minInt <= x then + if x <= maxInt then + toIntUnsafe (roundReal (x, m)) + else + raise Overflow + else + if x < minInt then + raise Overflow + else + raise Domain (* NaN *) + end + else + let + val maxInt' = #maxInt' other + val minInt' = #minInt' other + val maxInt = fromIntUnsafe maxInt' + val minInt = fromIntUnsafe minInt' + in + fn (m: rounding_mode) => fn x => + if minInt <= x then + if x <= maxInt then + toIntUnsafe (roundReal (x, m)) + else + if x < maxInt + one then + (case m of + TO_NEGINF => maxInt' + | TO_POSINF => raise Overflow + | TO_ZERO => maxInt' + | TO_NEAREST => + (* Depends on maxInt being odd. *) + if x - maxInt >= half then + raise Overflow + else + maxInt') + else + raise Overflow + else + if x < minInt then + if minInt - one < x then + (case m of + TO_NEGINF => raise Overflow + | TO_POSINF => minInt' + | TO_ZERO => minInt' + | TO_NEAREST => + (* Depends on minInt being even. *) + if x - minInt < ~half then + raise Overflow + else + minInt') + else + raise Overflow + else + raise Domain (* NaN *) + end) in val (fromInt8,toInt8) = make {fromIntUnsafe = R.fromInt8Unsafe, Index: lib/mlton/pervasive/pervasive.sml =================================================================== --- lib/mlton/pervasive/pervasive.sml (revision 6625) +++ lib/mlton/pervasive/pervasive.sml (working copy) @@ -31,6 +31,10 @@ structure Math = Math structure Option = Option structure OS = OS + structure PackReal32Little = PackReal32Little + structure PackReal64Little = PackReal64Little + structure PackWord32Little = PackWord32Little + structure PackWord64Little = PackWord64Little structure Position = Position structure Posix = Posix structure Real = Real @@ -47,9 +51,11 @@ structure Vector = Vector structure Word = Word structure Word32 = Word32 + structure Word64 = Word64 structure Word8 = Word8 structure Word16 = Word16 structure Word8Array = Word8Array + structure Word8Vector = Word8Vector type unit = General.unit type int = Int.int _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
|
|
Re: Constant Folding of FP OperationsOn Fri, 23 May 2008, Vesa Karvonen wrote:
> Attached is an experimental patch that improves MLton's constant folding > of floating point operations. The problem with constant folding FP > operations in SML is that the FP ops are subject to rounding mode > settings: > > http://www.standardml.org/Basis/ieee-float.html#SIG:IEEE_REAL.setRoundingMode:VAL > > The workaround used in the patch is to evaluate the operations in all > rounding modes (actually in only TO_NEGINF and TO_POSINF) and check that > the results agree. This ensures that constant folding is correct in all > rounding modes. Seems like a good optimization, and a sound one. Did you run the benchmarks and observe any speedups? > Personally I'm inclined to think that the inclusion of get/setRoundingMode > in the Basis library is a design mistake, because it makes some FP > optimizations very difficult. FP optimizations seem to be very difficult in general. It is particularly difficult because IEEE 754 says nothing about many floating-point operations; for example, sin/cos/tan are not specified to follow an implicit or explicit rounding mode. > An alternative design would be to have > multiple modules, say Real[ToNearest], RealToNegInf, RealToPosInf, and > RealToZero, each corresponding to a specific rounding mode. FP ops could > then be treated as functional and it would then likely become > straightforward to make the same kind of aggressive constant folding and > simplification of FP ops as is done for integer ops. Of course, this > would introduce the need for the compiler to insert code to switch the > rounding modes when necessary, but this doesn't seem extremely difficult. The C-- specification of floating-point expressions makes the rounding mode an explicit argument; e.g.: fsub \forall \alpha. bits \alpha * bits \alpha * bits 2 -> bits \alpha Floating-point subtract with explicit rounding mode. round_down bits2 IEEE 754 rounding mode for rounding down. The C-- specification of floating-point expressions also does not include any operations not specified by the IEEE 754 standard; that is, C-- does not provide trancendentals, etc. Instead, one must call the system's math library. > About the patch. The compiler changes do not compile with SML/NJ, which > does not support 32-bit floating point (Real32), and it should be made so > that the FP simplifications aren't performed (and would be replaced with > dummy functions) when the compiler has been compiled with SML/NJ. You might consider dynamically checking the value of Real32.precision, which would allow you to distinguish a "real" Real32 from a "dummy" Real32 that is bound to Real64. > This is > the reason why I consider this an experimental patch and haven't committed > it. Some of the floating point simplifications should probably also be > disabled when cross compiling. These should not be very difficult to do, > but it seems that I'm unlikely get around to implementing them. There also seems to be a little-endian assumption for constant-folding castToWord and castFromWord. It should be easy enough to use: val r64 = R {... subVec = fn arg => if Control.Target.bigEndian then P.PackWord64Big.subVec arg else P.PackWord64Little.subVec arg, ...} > The "decon" datatype has some constructors that aren't currently used > (although they are produced by the decon function) in the patch. I > planned those constructors for some optimizations (e.g. replace divide by > a power of 2 with a multiply) that aren't currently possible with MLton's > Prim.apply, because it does not allow the introduction of an operation > with constants in the result of Prim.apply. This restriction of > Prim.apply could probably be lifted without huge problems. The limitation > is probably due to the design of the SSA and SSA2 intermediate languages > of MLton that do not allow constants as operands. It shouldn't be difficult to allow Prim.apply to return more complicated expressions. There aren't that many (static) uses of Prim.apply in the compiler; you would simply need to allow uses of Prim.apply to yield a sequence of IL statements, so that you could bind constants to variables before using them in the IL primitive. With the exception of the caveats noted above (SML/NJ builds, cross-compiling, and target endiannes), I think the patch looks good. _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
|
|
Re: Constant Folding of FP OperationsOn Sun, 1 Jun 2008, Matthew Fluet wrote:
> On Fri, 23 May 2008, Vesa Karvonen wrote: >> Attached is an experimental patch that improves MLton's constant folding >> of floating point operations. The problem with constant folding FP >> operations in SML is that the FP ops are subject to rounding mode >> settings: >> >> http://www.standardml.org/Basis/ieee-float.html#SIG:IEEE_REAL.setRoundingMode:VAL >> >> The workaround used in the patch is to evaluate the operations in all >> rounding modes (actually in only TO_NEGINF and TO_POSINF) and check that >> the results agree. This ensures that constant folding is correct in all >> rounding modes. > > Seems like a good optimization, and a sound one. Did you run the benchmarks > and observe any speedups? On amd64-linux, I get the following: MLton0 -- ~/devel/mlton/mlton.svn.trunk/build/bin/mlton -codegen amd64 MLton1 -- ~/devel/mlton/mlton.svn.trunk/build/bin/mlton -codegen c MLton2 -- ~/devel/mlton/mlton.svn.trunk/build.real-cf/bin/mlton -codegen amd64 MLton3 -- ~/devel/mlton/mlton.svn.trunk/build.real-cf/bin/mlton -codegen c run time ratio benchmark MLton0 MLton1 MLton2 MLton3 barnes-hut 1.00 1.04 0.84 0.88 boyer 1.00 1.12 1.03 1.12 checksum 1.00 5.72 1.01 5.72 count-graphs 1.00 0.92 1.01 0.92 DLXSimulator 1.00 1.08 0.98 1.06 fft 1.00 1.05 1.00 1.06 fib 1.00 1.40 1.04 1.40 flat-array 1.00 1.60 0.99 1.61 hamlet 1.00 1.65 1.00 1.68 imp-for 1.00 1.46 0.99 1.47 knuth-bendix 1.00 1.36 1.00 1.36 lexgen 1.00 1.05 1.00 1.04 life 1.00 1.00 1.00 1.00 logic 1.00 1.08 1.00 1.09 mandelbrot 1.00 1.36 0.96 1.35 matrix-multiply 1.00 0.91 1.00 0.91 md5 1.00 6.94 1.00 6.94 merge 1.00 1.05 1.00 1.05 mlyacc 1.00 1.08 1.00 1.08 model-elimination 1.00 1.25 0.99 1.26 mpuz 1.00 1.93 1.00 1.93 nucleic 1.00 1.00 0.99 0.99 output1 1.00 1.23 1.00 1.24 peek 1.00 0.91 1.05 0.92 psdes-random 1.00 0.83 0.99 0.83 ratio-regions 1.00 1.21 1.00 1.33 ray 1.00 1.06 0.94 1.05 raytrace 1.00 1.13 0.96 1.05 simple 1.00 1.40 1.06 1.42 smith-normal-form 1.00 1.01 1.01 1.01 tailfib 1.00 1.87 1.06 1.87 tak 1.00 1.14 1.00 1.13 tensor 1.00 1.57 1.01 1.57 tsp 1.00 1.01 1.00 1.01 tyan 1.00 1.18 1.02 1.18 vector-concat 1.00 1.08 1.00 1.07 vector-rev 1.00 1.40 1.00 1.47 vliw 1.00 1.35 1.01 1.38 wc-input1 1.00 1.00 1.00 1.00 wc-scanStream 1.00 1.21 0.99 1.21 zebra 1.00 0.79 1.00 0.79 zern 1.00 1.62 1.00 1.64 This seems to only show a non-noise speedup on barnes-hut (and maybe ray). I'd really like to know a good way of cutting down the noise in the benchmarks; consider that fib and tailfib (which use no FP operations, and so yield identical assembly code) show a 1.04 and 1.06 slowdown, respectively. The rest of the benchmark data follows: size benchmark MLton0 MLton1 MLton2 MLton3 barnes-hut 165,614 170,237 167,663 172,286 boyer 218,529 220,105 218,529 220,105 checksum 98,257 105,473 98,257 105,473 count-graphs 124,401 127,073 124,401 127,073 DLXSimulator 201,324 210,004 201,324 210,004 fft 120,687 127,772 120,655 127,708 fib 98,225 97,321 98,225 97,321 flat-array 97,681 96,913 97,681 96,913 hamlet 1,509,177 1,542,601 1,508,809 1,545,585 imp-for 97,969 97,105 97,969 97,105 knuth-bendix 177,004 186,044 177,004 186,044 lexgen 291,003 318,683 291,003 318,683 life 122,257 118,777 122,257 118,777 logic 182,497 182,665 182,497 182,665 mandelbrot 97,857 100,545 97,841 100,529 matrix-multiply 99,969 102,225 99,969 102,225 md5 132,252 142,588 132,252 142,588 merge 99,601 106,417 99,601 106,417 mlyacc 663,259 704,187 663,259 704,187 model-elimination 865,986 953,682 866,002 953,666 mpuz 104,241 112,273 104,241 112,273 nucleic 273,760 256,196 273,760 256,196 output1 141,056 148,688 141,056 148,688 peek 137,804 143,212 137,804 143,212 psdes-random 101,169 99,665 101,169 99,665 ratio-regions 125,649 135,905 125,649 135,905 ray 249,400 257,839 248,728 258,223 raytrace 378,114 397,078 374,530 392,054 simple 347,593 377,012 346,777 376,548 smith-normal-form 276,332 292,884 276,332 292,884 tailfib 97,713 96,897 97,713 96,897 tak 98,273 97,273 98,273 97,273 tensor 167,507 174,971 167,507 174,971 tsp 144,827 151,658 144,363 151,354 tyan 217,644 229,268 217,644 229,268 vector-concat 99,617 98,457 99,617 98,457 vector-rev 99,217 98,281 99,217 98,281 vliw 528,426 616,042 526,874 614,490 wc-input1 164,522 169,826 164,522 169,826 wc-scanStream 175,258 184,914 175,258 184,914 zebra 217,196 219,948 217,196 219,948 zern 135,302 140,707 135,318 140,403 compile time benchmark MLton0 MLton1 MLton2 MLton3 barnes-hut 9.78 12.15 10.89 12.87 boyer 10.80 23.17 10.32 22.40 checksum 7.70 7.94 7.66 8.25 count-graphs 8.29 9.43 8.93 9.58 DLXSimulator 10.81 15.09 10.92 15.61 fft 8.19 9.04 8.53 9.21 fib 7.74 7.89 8.22 8.56 flat-array 7.56 7.58 8.18 8.13 hamlet 44.43 111.25 48.09 112.88 imp-for 7.75 7.79 7.71 7.96 knuth-bendix 9.51 13.01 9.86 13.41 lexgen 12.25 18.86 12.61 19.02 life 8.31 9.84 8.27 9.29 logic 10.09 13.89 9.90 14.41 mandelbrot 7.77 7.80 8.04 8.39 matrix-multiply 8.33 8.52 7.82 8.13 md5 8.64 10.02 8.74 10.50 merge 7.60 7.82 7.86 8.20 mlyacc 27.15 44.67 26.66 44.90 model-elimination 25.60 57.05 25.57 57.73 mpuz 7.92 8.23 8.05 8.34 nucleic 11.75 23.32 11.74 23.16 output1 8.57 10.34 8.72 10.81 peek 8.45 10.05 8.94 10.31 psdes-random 7.68 8.16 8.04 8.34 ratio-regions 8.97 10.33 9.42 10.98 ray 11.82 17.10 11.64 17.38 raytrace 15.24 26.74 14.96 26.36 simple 13.48 22.12 13.04 21.70 smith-normal-form 11.55 54.59 12.06 52.01 tailfib 7.63 7.94 7.85 8.10 tak 7.64 8.04 7.88 8.17 tensor 10.38 13.41 10.48 13.56 tsp 9.25 10.70 9.55 11.00 tyan 10.77 15.96 10.89 16.04 vector-concat 7.44 7.76 7.92 8.20 vector-rev 7.59 7.80 8.10 7.89 vliw 19.42 35.39 19.30 35.81 wc-input1 9.85 12.02 10.13 12.06 wc-scanStream 9.60 12.95 9.90 12.64 zebra 11.04 15.44 11.02 15.17 zern 8.28 9.75 8.85 9.81 run time benchmark MLton0 MLton1 MLton2 MLton3 barnes-hut 18.06 18.74 15.15 15.94 boyer 53.59 59.96 55.25 60.11 checksum 18.73 107.23 18.83 107.20 count-graphs 26.00 24.04 26.24 24.05 DLXSimulator 28.01 30.34 27.39 29.75 fft 14.61 15.34 14.54 15.45 fib 37.68 52.67 39.24 52.68 flat-array 29.27 46.93 29.02 47.00 hamlet 50.41 82.93 50.26 84.74 imp-for 26.97 39.41 26.73 39.56 knuth-bendix 24.96 34.07 24.90 34.01 lexgen 22.16 23.22 22.11 23.01 life 26.40 26.39 26.45 26.39 logic 23.22 25.07 23.31 25.41 mandelbrot 21.64 29.39 20.71 29.21 matrix-multiply 35.93 32.61 35.81 32.74 md5 33.22 230.39 33.24 230.60 merge 48.85 51.19 49.00 51.41 mlyacc 26.21 28.28 26.19 28.23 model-elimination 38.80 48.56 38.28 48.76 mpuz 23.48 45.41 23.44 45.35 nucleic 18.37 18.41 18.18 18.25 output1 37.22 45.94 37.34 46.09 peek 21.91 19.99 23.06 20.05 psdes-random 16.05 13.26 15.88 13.28 ratio-regions 124.21 150.12 124.00 164.76 ray 14.96 15.84 14.09 15.66 raytrace 17.23 19.49 16.49 18.15 simple 27.70 38.85 29.36 39.24 smith-normal-form 8.48 8.59 8.59 8.59 tailfib 22.48 41.96 23.79 41.95 tak 32.33 36.78 32.45 36.44 tensor 22.72 35.64 22.98 35.59 tsp 25.33 25.69 25.28 25.59 tyan 27.55 32.38 28.06 32.51 vector-concat 27.96 30.27 27.97 29.98 vector-rev 37.32 52.38 37.26 54.74 vliw 23.96 32.38 24.31 33.13 wc-input1 34.72 34.87 34.85 34.61 wc-scanStream 28.70 34.74 28.31 34.84 zebra 30.52 24.01 30.54 24.12 zern 22.54 36.43 22.59 36.90 _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
|
|
Re: Constant Folding of FP OperationsOn Sun, Jun 1, 2008 at 10:29 PM, Matthew Fluet <fluet@...> wrote:
> On Fri, 23 May 2008, Vesa Karvonen wrote: [...] > Seems like a good optimization, and a sound one. Did you run the > benchmarks and observe any speedups? No, I don't think I ran the benchmarks with this optimization, but I see that you did. However, I did try the optimization with a slightly modified version (note the parentheses in "store + (...)"): fun eval_real_poly n = let val msg = n div 10 val x1 = 4.0 val x2 = x1 val x3 = x1 val x4 = x1 val x5 = x1 val x6 = x1 fun eval (0,store) = store | eval (n,store) = let in if n mod msg = 0 then print ("iter: " ^ Int.toString n ^ "\n") else (); eval (n-1, store + ((~x1)*x4 + x2*x5 +(~x3)*x5 +(~x2)*x6 + x3*x6 + x4*(x2 + x3 + x5 + x6) + x4*(~x1)+x4*(~x4))) end in eval (n,0.0) end val _ = print (Real.toString (eval_real_poly 500000000) ^ "\n") of Sean McLauglin's example: http://mlton.org/pipermail/mlton-user/2007-December/001294.html After changing the SML expression to match the C expression (as shown above), MLton performed constant folding on the right-hand side expression and performance was much closer to C. > On amd64-linux, I get the following: > > MLton0 -- ~/devel/mlton/mlton.svn.trunk/build/bin/mlton -codegen amd64 > MLton1 -- ~/devel/mlton/mlton.svn.trunk/build/bin/mlton -codegen c > MLton2 -- ~/devel/mlton/mlton.svn.trunk/build.real-cf/bin/mlton -codegen > amd64 > MLton3 -- ~/devel/mlton/mlton.svn.trunk/build.real-cf/bin/mlton -codegen c > run time ratio > benchmark MLton0 MLton1 MLton2 MLton3 > barnes-hut 1.00 1.04 0.84 0.88 > ray 1.00 1.06 0.94 1.05 [...] > > This seems to only show a non-noise speedup on barnes-hut That would match my expectation that this optimization applies only rarely. I'm actually surprised to see such a large improvement on barnes-hut. > (and maybe ray). Probably. As we know, gcc already performs constant folding of FP, so I would not be surprised if the optimization would not necessarily improve performance with the C backend (as seems to be that case with ray). > I'd really like to know a good way of cutting down the noise in the > benchmarks; consider that fib and tailfib (which use no FP operations, > and so yield identical assembly code) show a 1.04 and 1.06 slowdown, > respectively. How are the times measured? Looking very briefly at benchmark/main.sml (line 79) it would seem to use an average of times over a 60 second sampling period. Using the average probably does not give the best results. For a deterministic benchmark it should be best to run the benchmark multiple times and only take the lowest time. Any time above the lowest possible time for a deterministic benchmark should be noise from the system (switches to other processes trashing caches, etc.). So, by taking an average, you are actually letting the noise enter the measurement. (This was mentioned in the paper on dominators http://www.cs.rice.edu/~keith/EMBED/dom.pdf based on which I implemented the new dominators algorithm for MLton a while back.) > The C-- specification of floating-point expressions makes the rounding mode > an explicit argument; e.g.: That sounds reasonable for an intermediate (or target) language. >> About the patch. The compiler changes do not compile with SML/NJ, >> which does not support 32-bit floating point (Real32), [...] > > You might consider dynamically checking the value of Real32.precision, > which would allow you to distinguish a "real" Real32 from a "dummy" > Real32 that is bound to Real64. Hmm... That approach would probably make it reasonably easy to make it compile with SML/NJ. The approaches I had in mind were somewhat more complicated. I'll have to try this. > There also seems to be a little-endian assumption for constant-folding > castToWord and castFromWord. [...] Not exactly. The assumption is that the endiannes of reals and words agree in the sense that if you convert a real/word to little endian bytes and then convert the little endian bytes to a word/real, you'll get the desired cast. The operations are (in pseudo code): castFromWord = RealFromLittle o WordToLittle castToWord = WordFromLittle o RealToLittle That should be identical to: castFromWord = RealFromBig o WordToBig castToWord = WordFromBig o RealToBig (I think that the Pack structures are also missing from SML/NJ, so they have to be faked as well.) > It shouldn't be difficult to allow Prim.apply to return more complicated > expressions. There aren't that many (static) uses of Prim.apply in the > compiler; you would simply need to allow uses of Prim.apply to yield a > sequence of IL statements, so that you could bind constants to variables > before using them in the IL primitive. Right. I was thinking about returning an expression (tree), but a sequence of expressions probably fits the ILs better. However, a difficulty with the sequence of statements approach would seem to be that Prim.apply is polymorphic in the type of variables, so something needs to be done to make it possible to effectively create new variables in Prim.apply. -Vesa Karvonen _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
|
|
Re: Constant Folding of FP OperationsOn Mon, 2 Jun 2008, Vesa Karvonen wrote:
> On Sun, Jun 1, 2008 at 10:29 PM, Matthew Fluet <fluet@...> wrote: >> On Fri, 23 May 2008, Vesa Karvonen wrote: > [...] >> Seems like a good optimization, and a sound one. Did you run the >> benchmarks and observe any speedups? > > No, I don't think I ran the benchmarks with this optimization, but I see > that you did. However, I did try the optimization with a slightly > modified version (note the parentheses in "store + (...)"): > > of Sean McLauglin's example: > > http://mlton.org/pipermail/mlton-user/2007-December/001294.html > > After changing the SML expression to match the C expression (as shown > above), MLton performed constant folding on the right-hand side expression > and performance was much closer to C. Good to see that the optimization kicked in, even if it is an artificial example. > Probably. As we know, gcc already performs constant folding of FP, so I > would not be surprised if the optimization would not necessarily improve > performance with the C backend (as seems to be that case with ray). Except that after considering Sean McLaughlin's example, we don't allow gcc to constant fold any floating-point options (in the C-codegen): http://mlton.org/cgi-bin/viewsvn.cgi?view=rev&rev=5799 This is accomplished by making all real constants globals, and accessed via an indirect load, so gcc can't see the constant value. So, since the C-codegen didn't show any improvement with FP constant-folding on the ray benchmark, I suspect that the improvement in the native codegen was just system noise. >> I'd really like to know a good way of cutting down the noise in the >> benchmarks; consider that fib and tailfib (which use no FP operations, >> and so yield identical assembly code) show a 1.04 and 1.06 slowdown, >> respectively. > > How are the times measured? Looking very briefly at benchmark/main.sml > (line 79) it would seem to use an average of times over a 60 second > sampling period. Using the average probably does not give the best > results. For a deterministic benchmark it should be best to run the > benchmark multiple times and only take the lowest time. Any time above > the lowest possible time for a deterministic benchmark should be noise > from the system (switches to other processes trashing caches, etc.). So, > by taking an average, you are actually letting the noise enter the > measurement. (This was mentioned in the paper on dominators > > http://www.cs.rice.edu/~keith/EMBED/dom.pdf > > based on which I implemented the new dominators algorithm for MLton a > while back.) That's an interesting view, to take the minimum time, rather than the average time. >>> About the patch. The compiler changes do not compile with SML/NJ, >>> which does not support 32-bit floating point (Real32), [...] >> >> You might consider dynamically checking the value of Real32.precision, >> which would allow you to distinguish a "real" Real32 from a "dummy" >> Real32 that is bound to Real64. > > Hmm... That approach would probably make it reasonably easy to make it > compile with SML/NJ. The approaches I had in mind were somewhat more > complicated. I'll have to try this. In the long term, it would be good to avoid reliance on Real32 for any purposes other than optimization. For example, we could make datatype RealX.t = T of {size: RealSize.t, value: IEEEReal.decimal_approx} which mimics the treatment of WordX.t. We currently require Real32.{fromString,isFinite} for elaborating real constants, Real32.{==,signBit} for equality on real constants, and Real32.format for emitting real constants. If Real32 is bound to Real64, then we might accept Real32.real floating-point constants that are finite in Real64 but would not be finite in Real32.[*] If Real32 is bound to Real64, then we might judge two floating-point constants to be unequal when they are equal at Real32.real; this is benign, since we only use equality of floating-point constants combine equal constants, so a conservative equality is fine. If Real32 is bound to Real64, then we might emit more digits than necessary to represent the value; again, this is benign. [*] We currently use Real<N>.isFinite to give a compile-time error for floating-point constants that do not denote finite values. The argument for this is that The Definition states: "it is a compile-time error if the constant does not make sense or does not denote a value within the machine representation chosen for the type." Of course, one might argue that "1.0e123" denotes a specific 32-bit IEEE 754 value (namely, +inf) just as much as a "1.0e~1" does -- for neither does the mathematical value of the IEEE value equal the mathematical value of the string. >> There also seems to be a little-endian assumption for constant-folding >> castToWord and castFromWord. [...] > > Not exactly. The assumption is that the endiannes of reals and words > agree in the sense that if you convert a real/word to little endian bytes > and then convert the little endian bytes to a word/real, you'll get the > desired cast. The operations are (in pseudo code): > > castFromWord = RealFromLittle o WordToLittle > castToWord = WordFromLittle o RealToLittle > > That should be identical to: > > castFromWord = RealFromBig o WordToBig > castToWord = WordFromBig o RealToBig Ah, I see. _______________________________________________ MLton mailing list MLton@... http://mlton.org/mailman/listinfo/mlton |
| Free Forum Powered by Nabble | Forum Help |