Grant wrote:
>>> Even so, if the FLAC CRC matches with AR's, the editor can be
>>> confident in adding it to the MBz DB right? Or maybe the AR CRC is
>>> against whole discs as opposed to individual tracks?
>>> - Grant
>>>
>> My bad for calling everything CRC I guess, but no they don't use the same
>> hashing algorithm and will never match.
>>
>
> How does AR do it's checksumming? Is it calculated based on a WAV or
> ISO of the entire disc?
>
> If it can be determined that a FLAC rip matches with AR, the embedded
> FLAC checksum, although different from whatever AR uses, could be
> added to the MBz DB with certainty right?
>
> - Grant
>
I reverse engineered the AR system a year ago or so, there's a perl
script that performs AR checking available from,
http://www.srcf.ucam.org/~cjk32/ARCue/The checksums are the (mod 2^32) sum of each 32bit LR sample multiplied
by it's offset within the track. The first and last five frames of the
first and last tracks are ignored to prevent problems with drives that
cannot overread into the lead-in or lead-out.
I do like the way accurate rip works, but there are some limitations,
and I've been wondering about how an improved system might operate.
AR seems to work around the following principle. There are two kinds of
errors one can suffer from, systematic errors and random noise.
The only realistic systematic error that will be encountered is an
constant offset of the samples read (e.g. when asked for sample 0, the
drive actually return sample 15), and EAC+AR deals with this by
establishing the drive's offset, correcting by this amount, and making
it difficult for the user to change it.
The second kind of error is random noise, caused by a damaged disc,
failing drive laser etc. There errors are manifested as random changes
in the data read, and will not be consistent across multiple reads
(ignoring any caching performed by the drive). Because these errors are
random and infrequent, if two independent reads of a disc give the same
data (or almost equivalently, the same checksum), then it is
overwhelmingly likely that both reads of the disc read the correct
data. AR collects all checksum submissions for a given discid, and when
it gets 2 or more the same for a given track / disc id, it considers
them correct. As it is possible for multiple pressings to have
different audio data, but the same disc id, it is quite possible to have
multiple valid checksums for each track on that disc.
There are a few problems with the current system.
Firstly, the measured drive read offsets used by the whole AR+EAC system
seem incorrect. The offset for one drive was established using an
ingenious, but flawed mechanism that gave in incorrect value. As this
drive offset was then used a refenence to determine all others, they all
share the same error. More recent tests using a different and arguably
better method have given a different drive offset, whic is much more
likely to be correct.
Secondly, AR doesn't allow any validation of the leading and trailing
five frames of audio; some drives cannot read this data, and it is hence
not included in the checksums.
It cannot deal (I believe) with audio hidden in the pregap.
My personal preference would be to use an AR like system, but with MD5
hashes based upon all the data in the track (i.e. not cutting of leading
and trailing frames), and using the newly measured 'correct' offset.
Such hashes would be collected for each track of each discid, and where
2 or more match, they would be published as a correct hash for that
track. The MD5 calculated for any track would be the same as the FLAC
MD5 checksum.
This system isn't ideal though, given the effort and infrastructure
already invested into the existing system. One way to take advantage of
the existing data might be to also calculate AR checksums using the
current method, and accept submissions of both as a set. The confidence
level for the AR checksums could then be applied to the MD5 hashes that
they span. For example, if the AR checksums indicated that tracks 1-3
were correct with a confidence of 50, you could then be sure that the
MD5 hash for track 2 was also correct, (because the range over which the
AR checksums for tracks 1-2 is calculated wholly covers the range over
which the MD5 hash for track 2 is calculated).
Any thoughts?
Chris
_______________________________________________
MusicBrainz-users mailing list
MusicBrainz-users@...
http://lists.musicbrainz.org/mailman/listinfo/musicbrainz-users