Negative position error

View: New views
14 Messages — Rating Filter:   Alert me  

Negative position error

by objectweb-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

we are encountering the following error when initializing HOWL. Do you have any idea what can be causing this?

Thank you

Miro Halas

Caused by: java.lang.IllegalArgumentException: Negative position
        at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:613)
        at org.objectweb.howl.log.BlockLogBuffer.read(BlockLogBuffer.java:412)
        at org.objectweb.howl.log.LogFileManager.read(LogFileManager.java:641)
        at org.objectweb.howl.log.LogBufferManager.replay(LogBufferManager.java:792)
        at org.objectweb.howl.log.Logger.replay(Logger.java:372)
        at ...



--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Could you provide a bit more of the stack trace showing how replay was
invoked?

Thanks
Michael Giroux


                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             11/20/2006 02:56          howl@...                  
             PM                                                         cc
                                                                           
                                                                   Subject
             Please respond to         [howl] Negative position error      
             howl@...                                            
                     g                                                    
                                                                           
                                                                           
                                                                           
                                                                           




Hi,

we are encountering the following error when initializing HOWL. Do you have
any idea what can be causing this?

Thank you

Miro Halas

Caused by: java.lang.IllegalArgumentException: Negative position
             at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:613)
             at
org.objectweb.howl.log.BlockLogBuffer.read(BlockLogBuffer.java:412)
             at
org.objectweb.howl.log.LogFileManager.read(LogFileManager.java:641)
             at
org.objectweb.howl.log.LogBufferManager.replay(LogBufferManager.java:792)
             at org.objectweb.howl.log.Logger.replay(Logger.java:372)
             at ...


--
You receive this message as a subscriber of the howl@... mailing
list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws





--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Negative position error

by objectweb-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I sent you more of the trace and also the code yesterday. Did you get it? For some reason it doesn't appear in the mailing list archive.

Thank you,

Miro



--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yes, I did receive it, but it really doesn't help much.

The only thing that comes to mind is a problem we had with an old version
of Linux.  In that case we were getting some incorrect file positioning.
Is it possible this applies?  What system are you running on?

Michael


                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             11/21/2006 12:27          howl@...                  
             PM                                                         cc
                                                                           
                                                                   Subject
             Please respond to         Re: Re: [howl] Negative position    
             howl@...         error                              
                     g                                                    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Hi,

I sent you more of the trace and also the code yesterday. Did you get it?
For some reason it doesn't appear in the mailing list archive.

Thank you,

Miro


--
You receive this message as a subscriber of the howl@... mailing
list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws





--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Negative position error

by objectweb-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Micheal,

the system is Windows Server 2003 Standard Edition. I have just noticed something, which may be the cause of the problem. Could you please review my logic and see if I am on the right track?

The file size of the cache causing the problem is 2,82GB. If the file position is int, this may be cause of a problem since it might be larger than 2GB and therefore would overflow and have negative value. I think this is caused by me since a long time ago I was caching quite a bit of data and I have configured HOWL with something like this (as you can see from the code I have sent you)

cacheConfig.setMaxBlocksPerFile(s_iPersistorCount * s_iBundleSize * 100);

s_iPersistorCount ~= 20
s_iBundleSize ~= 100

If I understand correctly, block size is configured using
cacheConfig.setBufferSize
and it is limited to 32, which represents 32K block which is 32695 bytes.

Therefore the above config would result to max file size of 6539000000bytes or ~6GB.

If this is the issue, I would recommend that HOWL should be checking for such illegal max size and maybe throw an exception during the configuration.

Please let me know, what you think.

Thank you,

Miro Halas



--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the additional information.

File positions are always long, but you have uncovered a problem in HOWL.
The record key assigned to each record is composed of an int block
sequence, and an offset within the block.  The problem you discovered could
occur with relatively small files as well.  In HOWL, a block sequence
number is never reused.  As files roll over, the bsn continues to
increment.  This technique protects the application from requesting a block
that has been overwritten. If the sequence numbers were related to physical
addresses, then the application could request block 1 and would get some
data from block 1.  However, if the block had been overwritten, that would
not be the actual data the application wanted.

To solve this, the bsn is constantly incremented, so position 0 of a file
would be block 1 initially, then when the file wraps around, it would be
overwritten with block 100 for example.  So if the files are around long
enough, we will get into a situation where the bsn approaches 32 bits and
the computation for seek address will result in a negative number.

I'll have to develop a test case for this, then figure out how to resolve
it.
Michael


                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             11/22/2006 11:11          howl@...                  
             AM                                                         cc
                                                                           
                                                                   Subject
             Please respond to         Re: Re: Re: [howl] Negative        
             howl@...         position error                      
                     g                                                    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Hi Micheal,

the system is Windows Server 2003 Standard Edition. I have just noticed
something, which may be the cause of the problem. Could you please review
my logic and see if I am on the right track?

The file size of the cache causing the problem is 2,82GB. If the file
position is int, this may be cause of a problem since it might be larger
than 2GB and therefore would overflow and have negative value. I think this
is caused by me since a long time ago I was caching quite a bit of data and
I have configured HOWL with something like this (as you can see from the
code I have sent you)

cacheConfig.setMaxBlocksPerFile(s_iPersistorCount * s_iBundleSize * 100);

s_iPersistorCount ~= 20
s_iBundleSize ~= 100

If I understand correctly, block size is configured using
cacheConfig.setBufferSize
and it is limited to 32, which represents 32K block which is 32695 bytes.

Therefore the above config would result to max file size of 6539000000bytes
or ~6GB.

If this is the issue, I would recommend that HOWL should be checking for
such illegal max size and maybe throw an exception during the
configuration.

Please let me know, what you think.

Thank you,

Miro Halas


--
You receive this message as a subscriber of the howl@... mailing
list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws





--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Negative position error

by objectweb-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Micheal,

thank your for your assistance with this issue. Do you have any ETA when this bug could be resolved? Is there anything I can do or help with? We have an upcoming release of our application and I would like to include the updated HOWL if possible.

Thank you,

Miro Halas



--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I was hoping there was no pressure on this.  I am only in the office for 5
more days, then I'm out till January.

Solving the problem is going to require a little thought.  The primary
issue is that the block sequence numbers increment continuously.  The value
is carried in an int, so the easiest solution to avoid it going negative is
to increment then mask 31 bits.  But that gets to a problem that I have not
considered previously -- ultimately, the sequence number wraps around to
zero.  This creates an issue with restart because I use BSN to locate the
logical end of the log by scanning until I find a block that has a BSN
lower than the previous block.  Essentially, once the jounal space is
reused, the previous data will have very old BSNs with values lower than
newer blocks, so the last good block is the one with the largest BSN.

That strategy needs to be augmented a bit once BSNs wrap around to zero.
The easy solution here is to include the Time field as part of the check.
This might be an easy change, but I also need to look into the seek address
calculations used by the methods that read the journal.  Currently the
calculations assume that BSN is ever increasing.  I will have to look into
this area as well.

Since the main purpose of HOWL is to support recovery, these areas need to
be very stable, so I need to develop some test cases to recreate the
situation, and verify that any changes do not break recovery once the
journal gets into this situation.

The bottom line is that I'm not sure I can get this done in the next week.
Until an update is available, I think the only avoidance is to delete and
recreate the journal files periodically.  If you ever get to a clean point
where there is no data in the journal that needs recovery, you could delete
and recreate the files.  Not a friendly solution, sorry.

Michael


                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             12/01/2006 10:48          howl@...                  
             AM                                                         cc
                                                                           
                                                                   Subject
             Please respond to         Re: Re: Re: Re: [howl] Negative    
             howl@...         position error                      
                     g                                                    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Hello Micheal,

thank your for your assistance with this issue. Do you have any ETA when
this bug could be resolved? Is there anything I can do or help with? We
have an upcoming release of our application and I would like to include the
updated HOWL if possible.

Thank you,

Miro Halas


--
You receive this message as a subscriber of the howl@... mailing
list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws





--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Re: Negative position error

by objectweb-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Macheal,

thank you for paying attention to this problem. The issue we (one probably other users that may have across this problems) encounter is that once this problem occurs, the log becomes unusable since howl within the application cannot even start because the the replay listener throws an exception. Basically at this time we do not have a choice and we have to delete/remove the log file and start from scratch. This may be happening actually quite often, since I was getting reports about our app (handling millions of transactions every day) having problems to restart (thankfully when individual servers are taken offline for maintenance, not due to failure) more frequently than I would expect and the only solution was to remove the old logs.

Regarding your solution, once thing which concerns me with the timestamp solution is that time can change (e.g. DST, synchronization, etc.) and this may cause unexpected situations. Not pretending I know much about internals of HOWL, have you consider recording during log file switch in the log file header the first and last recorded BSN for the log file and the first BSN for the current use of the log file? During recovery you could use this information to distinguish the old from the new ones since the old records are in between the first and last BSN for the previous use of the log file (here you have to account for the wrap around since first > last) and the new ones are the ones larger than the first written BSN for the current use of the lod that are not in the previously mentioned range. Therefore the last log record satisfying these two confitions would be the end of the log.

Hope you have a good vacation.

Miro



--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



objectweb@... wrote on 12/01/2006 01:20:38 PM:

> Macheal,
>
> thank you for paying attention to this problem. The issue we (one
> probably other users that may have across this problems) encounter is
> that once this problem occurs, the log becomes unusable since howl within
> the application cannot even start because the the replay listener throws
> an exception. Basically at this time we do not have a choice and we have
> to delete/remove the log file and start from scratch. This may be
> happening actually quite often,

I would be surprised if the frequency was very high.  You have to write
2.1 billion blocks before the situaion occurs.  If each block required
1 milli-second to write, and I doubt there is any hardware that can
achieve that, it would take 24 days for the problem to occur.

I agree that once it occurs, you are forced to start with new
log files, so this is fairly serious.  I'll figure something out.

> since I was getting reports about our app
> (handling millions of transactions every day) having problems to restart
> (thankfully when individual servers are taken offline for maintenance,
> not due to failure) more frequently than I would expect and the only
> solution was to remove the old logs.

Once the situation occurs, this is the only solution.

>
> Regarding your solution, once thing which concerns me with the timestamp
> solution is that time can change (e.g. DST, synchronization, etc.) and
> this may cause unexpected situations.

Time stamp is System.currentTimeMillis(), so it is not effected by DST.

> Not pretending I know much about
> internals of HOWL, have you consider recording during log file switch in
> the log file header the first and last recorded BSN for the log file and
> the first BSN for the current use of the log file? During recovery you
> could use this information to distinguish the old from the new ones since
> the old records are in between the first and last BSN for the previous
> use of the log file (here you have to account for the wrap around since
> first > last) and the new ones are the ones larger than the first written
> BSN for the current use of the lod that are not in the previously
> mentioned range. Therefore the last log record satisfying these two
> confitions would be the end of the log.
The solution will certainly need to take these factors into account.
Since there have been no bugs reported in a while, I have not had to
look at the code for a while.  I'll have to get my head back into this
before I feel comfortable saying yes or no to any ideas.

First requirement is to write some test case to reproduce this.

It sounds as if you can wait for the fix.  Good.
>
> Hope you have a good vacation.
>
> Miro
>
>
> --
> You receive this message as a subscriber of the howl@...
mailing list.
> To unsubscribe: mailto:howl-unsubscribe@...
> For general help: mailto:sympa@...?subject=help
> ObjectWeb mailing lists service home page: http://www.objectweb.org/wws




--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Miro,
I'm looking into a modification that should prevent this issue from
occurring.  I would like to check some info with you before I procede.

The basic problem as I have described is that the integer BSN has rolled
over on you.  I was suggesting that if you write a new journal block every
millisecond, then the rollover occurs every 24 days or so.

My thought is to increase the size of the BSN to 40 bits.  ( I cannot go to
a full 64 bits because the keys returned by Logger.put include both a BSN
and an offset within the block. )  I'm currently reserving 24 bits for
offset, but this could be reduced to 16 or 20, but lets look at what 40
bits does for us.

If we assume you are writing a journal block every millisecond continuously
24/7, then a 40 bit BSN would roll over once every 34 years.

If you think this solution works for you, then I will start investigating
the changes that need to be made.

Anyone else on the list who might be watching this is welcome to offer
opinions.

Thanks
Michael


                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             12/01/2006 01:22          howl@...                  
             PM                                                         cc
                                                                           
                                                                   Subject
             Please respond to         Re: Re: Re: Re: Re: [howl] Negative
             howl@...         position error                      
                     g                                                    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Macheal,

thank you for paying attention to this problem. The issue we (one probably
other users that may have across this problems) encounter is that once this
problem occurs, the log becomes unusable since howl within the application
cannot even start because the the replay listener throws an exception.
Basically at this time we do not have a choice and we have to delete/remove
the log file and start from scratch. This may be happening actually quite
often, since I was getting reports about our app (handling millions of
transactions every day) having problems to restart (thankfully when
individual servers are taken offline for maintenance, not due to failure)
more frequently than I would expect and the only solution was to remove the
old logs.

Regarding your solution, once thing which concerns me with the timestamp
solution is that time can change (e.g. DST, synchronization, etc.) and this
may cause unexpected situations. Not pretending I know much about internals
of HOWL, have you consider recording during log file switch in the log file
header the first and last recorded BSN for the log file and the first BSN
for the current use of the log file? During recovery you could use this
information to distinguish the old from the new ones since the old records
are in between the first and last BSN for the previous use of the log file
(here you have to account for the wrap around since first > last) and the
new ones are the ones larger than the first written BSN for the current use
of the lod that are not in the previously mentioned range. Therefore the
last log record satisfying these two confitions would be the end of the
log.

Hope you have a good vacation.

Miro


--
You receive this message as a subscriber of the howl@... mailing
list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws





--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Miro,
I am on vacation till Jan 2, but I decided to look at this a little more.
I managed to reproduce your negative seek issue.  This part of the problem
is strictly related to journal files > 2gb.

You can avoid this problem with current version of HOWL by changing
configuration to use files that will be < 2 gb each.

The issue with block sequence numbers is always detected by HOWL and
results in an InvalidLogKeyException.  This needs to be fixed as well, but
it is not as severe as the large file issue.

I would suggest reducing the size of your files until I'm able to generate
a fix.

Not sure how much time I'll be able to put into this while on vacation
cause the honeydo list is pretty long :)

Thanks for reporting this problem.  I should be able to fix it quickly now
that I have managed to generate a test case.

Michael


                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             12/01/2006 01:22          howl@...                  
             PM                                                         cc
                                                                           
                                                                   Subject
             Please respond to         Re: Re: Re: Re: Re: [howl] Negative
             howl@...         position error                      
                     g                                                    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Macheal,

thank you for paying attention to this problem. The issue we (one probably
other users that may have across this problems) encounter is that once this
problem occurs, the log becomes unusable since howl within the application
cannot even start because the the replay listener throws an exception.
Basically at this time we do not have a choice and we have to delete/remove
the log file and start from scratch. This may be happening actually quite
often, since I was getting reports about our app (handling millions of
transactions every day) having problems to restart (thankfully when
individual servers are taken offline for maintenance, not due to failure)
more frequently than I would expect and the only solution was to remove the
old logs.

Regarding your solution, once thing which concerns me with the timestamp
solution is that time can change (e.g. DST, synchronization, etc.) and this
may cause unexpected situations. Not pretending I know much about internals
of HOWL, have you consider recording during log file switch in the log file
header the first and last recorded BSN for the log file and the first BSN
for the current use of the log file? During recovery you could use this
information to distinguish the old from the new ones since the old records
are in between the first and last BSN for the previous use of the log file
(here you have to account for the wrap around since first > last) and the
new ones are the ones larger than the first written BSN for the current use
of the lod that are not in the previously mentioned range. Therefore the
last log record satisfying these two confitions would be the end of the
log.

Hope you have a good vacation.

Miro


--
You receive this message as a subscriber of the howl@... mailing
list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws





--
You receive this message as a subscriber of the howl@... mailing list.
To unsubscribe: mailto:howl-unsubscribe@...
For general help: mailto:sympa@...?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws

Re: Re: Re: Re: Negative position error

by MLGiroux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Miro,
I'm cleaning up my inbox and noticed this message.  Just in case you did
not notice, I did issue an update that resolves this problem.

Michael




                                                                           
             objectweb@bastafi                                            
             dli.com                                                      
                                                                        To
             12/01/2006 10:48          howl@...                  
             AM