Some notes concening VM :)

View: New views
3 Messages — Rating Filter:   Alert me  

Some notes concening VM :)

by Igor Stasenko :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello , i'm new here and have particular interests in developping a Slate VM.
I didn't decided yet if i will participate in this project in nearest
future, so i just want to discuss some of my ideas with you for now.

Concerning OMM (object memory model) and stack organization.
I'm refer to tables which you can see on this page:
http://slate.tunes.org/doc/mobius/

5.1.1 stack format.

Table 3. Block format.

I see little reason why selectors must be kept apart from literals.
The only place i see where they handled differently is at
interpretation stage to use different opcode 'Load Selector'.
If there's no extra processing except that, we can merge selectors
with literals slot and remove an opcode 'Load Selector'.

This is the code that shows that they handled in same manner..

static INLINE void PSInterpreter_loadLiteral_(struct Interpreter * i,
unsigned long int n)
{
  PSInterpreter_stackPush_(i, (((i -> method) -> literals) -> elements)[n]);
}

static INLINE void PSInterpreter_loadSelector_(struct Interpreter * i,
unsigned long int n)
{
  PSInterpreter_stackPush_(i, (((i -> method) -> selectors) -> elements)[n]);
}

Maybe i don't see potential optimizing patterns which may utilise such
distinction? Please let me know.

Same about 'Push Integer' opcode. Encoding integer constant in
bytecode may be somewhat faster than extra lookup in literals array,
but only in cases when your constant between 0 and 14, otherwise it
will require extra processing: loading next byte from opcode array and
adding its value with 15, also checking that its higher bit is 0...
this extra processing may be even more costly than simply loading
literal from literals array.

You may say that merging all these opcodes to use single literal array
will lead to higher number of literals in methods, which in turn leads
to higher probability having a literal index greater than 14 in
opcodes, and thus, will increase overall time of instruction
processing as well as bytecode size.
But we have 2 unused bytecodes, yes? Then why we can't reuse them to
load literals with index higher than 14 , which are already do not fit
in single byte opcode.
So, for high literal indexes we can introduce Load Literal2 , and even
Load Literal3
- first takes index from next byte of instruction stream, second takes
next two bytes from instruction stream for insanely big indexes higher
than 255.
Both of these instructions can be placed under single opcode index but
with different Immediate value, which will instruct interpreter how
many bytes in instruction stream is reserved for index.
Honestly i'm not very care about compact bytecode representation, i'm
care about speed and uniform data organization.
Using immediate value in opcode which if > 14 must be added to next
byte requires additional processing - adding 15 to next byte instead
of just loading next byte(s) and use them directly as final index
value.
I'd rather use 4 bits of 'Immediate value' for indicating exact number
of bytes in bytecode stream reserved for index value instead of using
immediate value directly and also checking following bytes for higher
bit set (0x80) and so on.

Why interpreter must spend its time to calculate index value (see the
PSInterpreter_decodeImmediate function) , when its value already known
at compilation time and we already can say if it fits in 1 byte code
or 2 bytes or 3?


The next things is concerning stack organization.

For those of you who knows the needs of Slate better than me, may find
this unsiutable or less effective or more complex. But first, read to
the end. :)

Lets review the Frame Format (in Table 2)
The need in stack frame is the need of tracking current context,
return point, current instruction pointer and previous stack frame
where we must return from local return.
Other values on stack, like arguments and locals is pushed by opcode
instructions and they number depends from current execution point and
context e.t.c.
But frames always have fixed number of variables.
At some moment i asked myself: what if organize separate stack for
frames? Can it improve an execution speed as well as reduce complexity
of some sort?

So, lets divide our stack on two parts: in first part we store all
variables used by methods/blocks while in second we store only stack
frames, which have fixed sizes.
First that is evident: we have much simplified methods for traversing
between stack frames - since they having fixed size we can traverse
the frames stack by simply adding/subtracting contstant value from
frame stack pointer.
Second: we don't need a Pop instruction in bytecode. Pop is used to
remove values on stack when we returning from method/block closure,
but our stack frame already knows its return point, so by returning
from method we discarding all values up to point which stored in
frame's return point. Simple.
Third: accessing lexical scope for nested blocks and/or checking
expired context requires much less processing.
Fourth: most variables which responsible for current interpretation
state can be moved from Interpreter struct into Frame struct leaving
only pointer to current Frame and then,
we don't need to reload these variables each time we enter/return from
method. We simply initialize them for the new Frame on entering and
decrementing Frame pointer on exiting method. No extra processing.

--- there are so many things left to discuss, but i must go sleep :)

I hope my ideas was worth reading them :)

regards,
sig ( Stasenko IGor )


Re: Some notes concening VM :)

by Brian Rice :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Apr 28, 2007, at 7:25 PM, sig wrote:

> Hello , i'm new here and have particular interests in developping a  
> Slate VM.
> I didn't decided yet if i will participate in this project in nearest
> future, so i just want to discuss some of my ideas with you for now.
>
> Concerning OMM (object memory model) and stack organization.
> I'm refer to tables which you can see on this page:
> http://slate.tunes.org/doc/mobius/
>
> 5.1.1 stack format.
>
> Table 3. Block format.
>
> I see little reason why selectors must be kept apart from literals.
> The only place i see where they handled differently is at
> interpretation stage to use different opcode 'Load Selector'.
> If there's no extra processing except that, we can merge selectors
> with literals slot and remove an opcode 'Load Selector'.
>
> This is the code that shows that they handled in same manner..
>
> static INLINE void PSInterpreter_loadLiteral_(struct Interpreter * i,
> unsigned long int n)
> {
>  PSInterpreter_stackPush_(i, (((i -> method) -> literals) ->  
> elements)[n]);
> }
>
> static INLINE void PSInterpreter_loadSelector_(struct Interpreter * i,
> unsigned long int n)
> {
>  PSInterpreter_stackPush_(i, (((i -> method) -> selectors) ->  
> elements)[n]);
> }
>
> Maybe i don't see potential optimizing patterns which may utilise such
> distinction? Please let me know.

Well, I can say that for introspection, it is easier to tell what  
selectors a method sends if they are in a separate array, so easier  
to track down a call graph without referring to source code. Granted,  
in the general case, plenty of those literal Symbols are also used by  
sendTo: and variants, or Symbols can be generated at run-time by the  
method, but this makes the base case easier. Also granted, just doing  
"method literals select: [| :each | each is: Symbol]" isn't that much  
harder (although is: is kind of expensive in aggregate). So I  
wouldn't fight much for this one.

Perhaps the only other defense is how well the method format scales  
up, that is, how does method size grow with source length. The truly  
anomalous case is the bootstrap method... in an image bootstrap, what  
happens is that the Slate system compiles all the core source files'  
contents into a *single* bytecode method which is then marked as the  
starting point for kernel image execution, at which point it is run  
in order to build the system and then discarded.

> Same about 'Push Integer' opcode. Encoding integer constant in
> bytecode may be somewhat faster than extra lookup in literals array,
> but only in cases when your constant between 0 and 14, otherwise it
> will require extra processing: loading next byte from opcode array and
> adding its value with 15, also checking that its higher bit is 0...
> this extra processing may be even more costly than simply loading
> literal from literals array.

It's also a question of what the common case of integer literal is.  
Perhaps FF could indicate that the next number is an index into the  
literals array, so that tiny integers can be read directly and the  
rest is just an array lookup.

> You may say that merging all these opcodes to use single literal array
> will lead to higher number of literals in methods, which in turn leads
> to higher probability having a literal index greater than 14 in
> opcodes, and thus, will increase overall time of instruction
> processing as well as bytecode size.

Yes, and it would be true, but there are alternative approaches such  
as what I suggest above.

> But we have 2 unused bytecodes, yes? Then why we can't reuse them to
> load literals with index higher than 14 , which are already do not fit
> in single byte opcode.
> So, for high literal indexes we can introduce Load Literal2 , and even
> Load Literal3
> - first takes index from next byte of instruction stream, second takes
> next two bytes from instruction stream for insanely big indexes higher
> than 255.
> Both of these instructions can be placed under single opcode index but
> with different Immediate value, which will instruct interpreter how
> many bytes in instruction stream is reserved for index.
> Honestly i'm not very care about compact bytecode representation, i'm
> care about speed and uniform data organization.
> Using immediate value in opcode which if > 14 must be added to next
> byte requires additional processing - adding 15 to next byte instead
> of just loading next byte(s) and use them directly as final index
> value.
> I'd rather use 4 bits of 'Immediate value' for indicating exact number
> of bytes in bytecode stream reserved for index value instead of using
> immediate value directly and also checking following bytes for higher
> bit set (0x80) and so on.
>
> Why interpreter must spend its time to calculate index value (see the
> PSInterpreter_decodeImmediate function) , when its value already known
> at compilation time and we already can say if it fits in 1 byte code
> or 2 bytes or 3?

These are good points, and worth considering, but I think it'd be  
worth some profiling time to know what the real costs/benefits are.

> The next things is concerning stack organization.
>
> For those of you who knows the needs of Slate better than me, may find
> this unsiutable or less effective or more complex. But first, read to
> the end. :)
>
> Lets review the Frame Format (in Table 2)
> The need in stack frame is the need of tracking current context,
> return point, current instruction pointer and previous stack frame
> where we must return from local return.
> Other values on stack, like arguments and locals is pushed by opcode
> instructions and they number depends from current execution point and
> context e.t.c.
> But frames always have fixed number of variables.
> At some moment i asked myself: what if organize separate stack for
> frames? Can it improve an execution speed as well as reduce complexity
> of some sort?
>
> So, lets divide our stack on two parts: in first part we store all
> variables used by methods/blocks while in second we store only stack
> frames, which have fixed sizes.
> First that is evident: we have much simplified methods for traversing
> between stack frames - since they having fixed size we can traverse
> the frames stack by simply adding/subtracting contstant value from
> frame stack pointer.
> Second: we don't need a Pop instruction in bytecode. Pop is used to
> remove values on stack when we returning from method/block closure,
> but our stack frame already knows its return point, so by returning
> from method we discarding all values up to point which stored in
> frame's return point. Simple.
> Third: accessing lexical scope for nested blocks and/or checking
> expired context requires much less processing.
> Fourth: most variables which responsible for current interpretation
> state can be moved from Interpreter struct into Frame struct leaving
> only pointer to current Frame and then,
> we don't need to reload these variables each time we enter/return from
> method. We simply initialize them for the new Frame on entering and
> decrementing Frame pointer on exiting method. No extra processing.

Yes, these things are all true and your analysis seems sound.

I think the balancing factor which Lee had in mind was the ability to  
share the stack format between the bytecode system and the more  
complex binary compiler run-time. I don't recall his reasoning on  
this, or how well he documented it, however. Perhaps simply by  
knowing this, you might see where his mind was going when he made  
these decisions.

> --- there are so many things left to discuss, but i must go sleep :)
>
> I hope my ideas was worth reading them :)

Yep, thanks for sharing and for being interested in the first place.

--
-Brian
http://briantrice.com


Re: Some notes concening VM :)

by Igor Stasenko :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Well, I can say that for introspection, it is easier to tell what
> selectors a method sends if they are in a separate array, so easier
> to track down a call graph without referring to source code. Granted,
> in the general case, plenty of those literal Symbols are also used by
> sendTo: and variants, or Symbols can be generated at run-time by the
> method, but this makes the base case easier. Also granted, just doing
> "method literals select: [| :each | each is: Symbol]" isn't that much
> harder (although is: is kind of expensive in aggregate). So I
> wouldn't fight much for this one.

as i see the CompiledMethod structure holds a sourceTree object, which
i suppose holds exact answers on what literals is selectors and what
is symbols or something else.
Im just against sacrificing the runtime model simplicity in favor of
analysis, which can be simply accessible through other means.

> > Why interpreter must spend its time to calculate index value (see the
> > PSInterpreter_decodeImmediate function) , when its value already known
> > at compilation time and we already can say if it fits in 1 byte code
> > or 2 bytes or 3?
>
> These are good points, and worth considering, but I think it'd be
> worth some profiling time to know what the real costs/benefits are.
>

My approach is simple: minimalize number of operations required to
evaluate single opcode instruction.The less we have - the more we
gain.

> > The next things is concerning stack organization.

> Yes, these things are all true and your analysis seems sound.
>
> I think the balancing factor which Lee had in mind was the ability to
> share the stack format between the bytecode system and the more
> complex binary compiler run-time. I don't recall his reasoning on
> this, or how well he documented it, however. Perhaps simply by
> knowing this, you might see where his mind was going when he made
> these decisions.
>

Using stack to run both CPU and Slate bytecode is dangerous. In case
of exception it may be too hard to unwind stack to some safe point. If
you know how this can be safely done, tell me. I don't think that its
impossible, it just demands some research.

Also GC at mark phase must know the set of its root objects , this
means that it  must scan stack to determine roots.. And it can be too
hard and expensive to scan stack where real objects mixed with foreign
data.
But! If we use dual stack this can be quite possible. Because each
stack frame have information of its starting position in data stack
and ending position (point where it's locals and pushed values ended,
and thus anything beyond this point is not of GC's interest).
Then we can use data stack to operate on anything we want, we just
need to ensure that we don't exceed its size when executing system/vm
subroutines.

And one more thing: we dont need to check the stack boundaries each
time we pushing something on stack. The maximum stack depth can be
easily calculated by compiler for each compiled method. So by entering
method we just need to ensure that stack is capable of holding this
maximum. This value can be encoded in first bytecode instruction in
method, or just by adding field to the CompiledMethod struct.

LightInTheBox - Buy quality products at wholesale price