|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
Some notes concening VM :)Hello , i'm new here and have particular interests in developping a Slate VM.
I didn't decided yet if i will participate in this project in nearest future, so i just want to discuss some of my ideas with you for now. Concerning OMM (object memory model) and stack organization. I'm refer to tables which you can see on this page: http://slate.tunes.org/doc/mobius/ 5.1.1 stack format. Table 3. Block format. I see little reason why selectors must be kept apart from literals. The only place i see where they handled differently is at interpretation stage to use different opcode 'Load Selector'. If there's no extra processing except that, we can merge selectors with literals slot and remove an opcode 'Load Selector'. This is the code that shows that they handled in same manner.. static INLINE void PSInterpreter_loadLiteral_(struct Interpreter * i, unsigned long int n) { PSInterpreter_stackPush_(i, (((i -> method) -> literals) -> elements)[n]); } static INLINE void PSInterpreter_loadSelector_(struct Interpreter * i, unsigned long int n) { PSInterpreter_stackPush_(i, (((i -> method) -> selectors) -> elements)[n]); } Maybe i don't see potential optimizing patterns which may utilise such distinction? Please let me know. Same about 'Push Integer' opcode. Encoding integer constant in bytecode may be somewhat faster than extra lookup in literals array, but only in cases when your constant between 0 and 14, otherwise it will require extra processing: loading next byte from opcode array and adding its value with 15, also checking that its higher bit is 0... this extra processing may be even more costly than simply loading literal from literals array. You may say that merging all these opcodes to use single literal array will lead to higher number of literals in methods, which in turn leads to higher probability having a literal index greater than 14 in opcodes, and thus, will increase overall time of instruction processing as well as bytecode size. But we have 2 unused bytecodes, yes? Then why we can't reuse them to load literals with index higher than 14 , which are already do not fit in single byte opcode. So, for high literal indexes we can introduce Load Literal2 , and even Load Literal3 - first takes index from next byte of instruction stream, second takes next two bytes from instruction stream for insanely big indexes higher than 255. Both of these instructions can be placed under single opcode index but with different Immediate value, which will instruct interpreter how many bytes in instruction stream is reserved for index. Honestly i'm not very care about compact bytecode representation, i'm care about speed and uniform data organization. Using immediate value in opcode which if > 14 must be added to next byte requires additional processing - adding 15 to next byte instead of just loading next byte(s) and use them directly as final index value. I'd rather use 4 bits of 'Immediate value' for indicating exact number of bytes in bytecode stream reserved for index value instead of using immediate value directly and also checking following bytes for higher bit set (0x80) and so on. Why interpreter must spend its time to calculate index value (see the PSInterpreter_decodeImmediate function) , when its value already known at compilation time and we already can say if it fits in 1 byte code or 2 bytes or 3? The next things is concerning stack organization. For those of you who knows the needs of Slate better than me, may find this unsiutable or less effective or more complex. But first, read to the end. :) Lets review the Frame Format (in Table 2) The need in stack frame is the need of tracking current context, return point, current instruction pointer and previous stack frame where we must return from local return. Other values on stack, like arguments and locals is pushed by opcode instructions and they number depends from current execution point and context e.t.c. But frames always have fixed number of variables. At some moment i asked myself: what if organize separate stack for frames? Can it improve an execution speed as well as reduce complexity of some sort? So, lets divide our stack on two parts: in first part we store all variables used by methods/blocks while in second we store only stack frames, which have fixed sizes. First that is evident: we have much simplified methods for traversing between stack frames - since they having fixed size we can traverse the frames stack by simply adding/subtracting contstant value from frame stack pointer. Second: we don't need a Pop instruction in bytecode. Pop is used to remove values on stack when we returning from method/block closure, but our stack frame already knows its return point, so by returning from method we discarding all values up to point which stored in frame's return point. Simple. Third: accessing lexical scope for nested blocks and/or checking expired context requires much less processing. Fourth: most variables which responsible for current interpretation state can be moved from Interpreter struct into Frame struct leaving only pointer to current Frame and then, we don't need to reload these variables each time we enter/return from method. We simply initialize them for the new Frame on entering and decrementing Frame pointer on exiting method. No extra processing. --- there are so many things left to discuss, but i must go sleep :) I hope my ideas was worth reading them :) regards, sig ( Stasenko IGor ) |
|
|
Re: Some notes concening VM :)On Apr 28, 2007, at 7:25 PM, sig wrote: > Hello , i'm new here and have particular interests in developping a > Slate VM. > I didn't decided yet if i will participate in this project in nearest > future, so i just want to discuss some of my ideas with you for now. > > Concerning OMM (object memory model) and stack organization. > I'm refer to tables which you can see on this page: > http://slate.tunes.org/doc/mobius/ > > 5.1.1 stack format. > > Table 3. Block format. > > I see little reason why selectors must be kept apart from literals. > The only place i see where they handled differently is at > interpretation stage to use different opcode 'Load Selector'. > If there's no extra processing except that, we can merge selectors > with literals slot and remove an opcode 'Load Selector'. > > This is the code that shows that they handled in same manner.. > > static INLINE void PSInterpreter_loadLiteral_(struct Interpreter * i, > unsigned long int n) > { > PSInterpreter_stackPush_(i, (((i -> method) -> literals) -> > elements)[n]); > } > > static INLINE void PSInterpreter_loadSelector_(struct Interpreter * i, > unsigned long int n) > { > PSInterpreter_stackPush_(i, (((i -> method) -> selectors) -> > elements)[n]); > } > > Maybe i don't see potential optimizing patterns which may utilise such > distinction? Please let me know. Well, I can say that for introspection, it is easier to tell what selectors a method sends if they are in a separate array, so easier to track down a call graph without referring to source code. Granted, in the general case, plenty of those literal Symbols are also used by sendTo: and variants, or Symbols can be generated at run-time by the method, but this makes the base case easier. Also granted, just doing "method literals select: [| :each | each is: Symbol]" isn't that much harder (although is: is kind of expensive in aggregate). So I wouldn't fight much for this one. Perhaps the only other defense is how well the method format scales up, that is, how does method size grow with source length. The truly anomalous case is the bootstrap method... in an image bootstrap, what happens is that the Slate system compiles all the core source files' contents into a *single* bytecode method which is then marked as the starting point for kernel image execution, at which point it is run in order to build the system and then discarded. > Same about 'Push Integer' opcode. Encoding integer constant in > bytecode may be somewhat faster than extra lookup in literals array, > but only in cases when your constant between 0 and 14, otherwise it > will require extra processing: loading next byte from opcode array and > adding its value with 15, also checking that its higher bit is 0... > this extra processing may be even more costly than simply loading > literal from literals array. It's also a question of what the common case of integer literal is. Perhaps FF could indicate that the next number is an index into the literals array, so that tiny integers can be read directly and the rest is just an array lookup. > You may say that merging all these opcodes to use single literal array > will lead to higher number of literals in methods, which in turn leads > to higher probability having a literal index greater than 14 in > opcodes, and thus, will increase overall time of instruction > processing as well as bytecode size. Yes, and it would be true, but there are alternative approaches such as what I suggest above. > But we have 2 unused bytecodes, yes? Then why we can't reuse them to > load literals with index higher than 14 , which are already do not fit > in single byte opcode. > So, for high literal indexes we can introduce Load Literal2 , and even > Load Literal3 > - first takes index from next byte of instruction stream, second takes > next two bytes from instruction stream for insanely big indexes higher > than 255. > Both of these instructions can be placed under single opcode index but > with different Immediate value, which will instruct interpreter how > many bytes in instruction stream is reserved for index. > Honestly i'm not very care about compact bytecode representation, i'm > care about speed and uniform data organization. > Using immediate value in opcode which if > 14 must be added to next > byte requires additional processing - adding 15 to next byte instead > of just loading next byte(s) and use them directly as final index > value. > I'd rather use 4 bits of 'Immediate value' for indicating exact number > of bytes in bytecode stream reserved for index value instead of using > immediate value directly and also checking following bytes for higher > bit set (0x80) and so on. > > Why interpreter must spend its time to calculate index value (see the > PSInterpreter_decodeImmediate function) , when its value already known > at compilation time and we already can say if it fits in 1 byte code > or 2 bytes or 3? These are good points, and worth considering, but I think it'd be worth some profiling time to know what the real costs/benefits are. > The next things is concerning stack organization. > > For those of you who knows the needs of Slate better than me, may find > this unsiutable or less effective or more complex. But first, read to > the end. :) > > Lets review the Frame Format (in Table 2) > The need in stack frame is the need of tracking current context, > return point, current instruction pointer and previous stack frame > where we must return from local return. > Other values on stack, like arguments and locals is pushed by opcode > instructions and they number depends from current execution point and > context e.t.c. > But frames always have fixed number of variables. > At some moment i asked myself: what if organize separate stack for > frames? Can it improve an execution speed as well as reduce complexity > of some sort? > > So, lets divide our stack on two parts: in first part we store all > variables used by methods/blocks while in second we store only stack > frames, which have fixed sizes. > First that is evident: we have much simplified methods for traversing > between stack frames - since they having fixed size we can traverse > the frames stack by simply adding/subtracting contstant value from > frame stack pointer. > Second: we don't need a Pop instruction in bytecode. Pop is used to > remove values on stack when we returning from method/block closure, > but our stack frame already knows its return point, so by returning > from method we discarding all values up to point which stored in > frame's return point. Simple. > Third: accessing lexical scope for nested blocks and/or checking > expired context requires much less processing. > Fourth: most variables which responsible for current interpretation > state can be moved from Interpreter struct into Frame struct leaving > only pointer to current Frame and then, > we don't need to reload these variables each time we enter/return from > method. We simply initialize them for the new Frame on entering and > decrementing Frame pointer on exiting method. No extra processing. Yes, these things are all true and your analysis seems sound. I think the balancing factor which Lee had in mind was the ability to share the stack format between the bytecode system and the more complex binary compiler run-time. I don't recall his reasoning on this, or how well he documented it, however. Perhaps simply by knowing this, you might see where his mind was going when he made these decisions. > --- there are so many things left to discuss, but i must go sleep :) > > I hope my ideas was worth reading them :) Yep, thanks for sharing and for being interested in the first place. -- -Brian http://briantrice.com |
|
|
Re: Some notes concening VM :)> Well, I can say that for introspection, it is easier to tell what
> selectors a method sends if they are in a separate array, so easier > to track down a call graph without referring to source code. Granted, > in the general case, plenty of those literal Symbols are also used by > sendTo: and variants, or Symbols can be generated at run-time by the > method, but this makes the base case easier. Also granted, just doing > "method literals select: [| :each | each is: Symbol]" isn't that much > harder (although is: is kind of expensive in aggregate). So I > wouldn't fight much for this one. as i see the CompiledMethod structure holds a sourceTree object, which i suppose holds exact answers on what literals is selectors and what is symbols or something else. Im just against sacrificing the runtime model simplicity in favor of analysis, which can be simply accessible through other means. > > Why interpreter must spend its time to calculate index value (see the > > PSInterpreter_decodeImmediate function) , when its value already known > > at compilation time and we already can say if it fits in 1 byte code > > or 2 bytes or 3? > > These are good points, and worth considering, but I think it'd be > worth some profiling time to know what the real costs/benefits are. > My approach is simple: minimalize number of operations required to evaluate single opcode instruction.The less we have - the more we gain. > > The next things is concerning stack organization. > Yes, these things are all true and your analysis seems sound. > > I think the balancing factor which Lee had in mind was the ability to > share the stack format between the bytecode system and the more > complex binary compiler run-time. I don't recall his reasoning on > this, or how well he documented it, however. Perhaps simply by > knowing this, you might see where his mind was going when he made > these decisions. > Using stack to run both CPU and Slate bytecode is dangerous. In case of exception it may be too hard to unwind stack to some safe point. If you know how this can be safely done, tell me. I don't think that its impossible, it just demands some research. Also GC at mark phase must know the set of its root objects , this means that it must scan stack to determine roots.. And it can be too hard and expensive to scan stack where real objects mixed with foreign data. But! If we use dual stack this can be quite possible. Because each stack frame have information of its starting position in data stack and ending position (point where it's locals and pushed values ended, and thus anything beyond this point is not of GC's interest). Then we can use data stack to operate on anything we want, we just need to ensure that we don't exceed its size when executing system/vm subroutines. And one more thing: we dont need to check the stack boundaries each time we pushing something on stack. The maximum stack depth can be easily calculated by compiler for each compiled method. So by entering method we just need to ensure that stack is capable of holding this maximum. This value can be encoded in first bytecode instruction in method, or just by adding field to the CompiledMethod struct. |
| Free Forum Powered by Nabble | Forum Help |