major_clanger | Blog Alert: Digital Curation Blog

Does your work (or hobby) involve archiving data, or looking at information that other people have archived? If so, you may be interested in the Digital Curation Blog, run by some people I know at Edinburgh University. Their latest post has some interesting snippets from a workshop on digital archiving, including this thought-provoking comment on storage formats:

...a half percent error rate in a BMP file shows a smattering of black pixels, whereas in a GIF file there were serious artefacts and visible damage introduced. Same error rate on a WAV file produces a barely audible rustle effect, while on a MP3 files sound is seriously distorted/. Same error rate on a DOC or PDF file, and you get “File damaged, cannot open”. Be very afraid!

Having done error-correction theory in my MSc (which included some painfully weird binary mathematics, some of which I understood for just long enough to pass the exam) it occurs to me that the overhead of adding error-correction coding to highly-compressed storage formats would be far less than the storage space saved my moving to them. An MP3 file is around a ten times smaller than the WAV file it is created from, so even going from 8 bits to 8+3 bits would still give a compression of seven times, with much improved resistance to bit errors. Anyone know anything about the use of ECC in highly-compressed lossy file formats?

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Most Popular Tags

3d printing - 3 uses
aircraft - 6 uses
art - 7 uses
astonomy - 3 uses
astronomy - 5 uses
barrister - 3 uses
bavaria - 4 uses
birmingham - 3 uses
books - 20 uses
cars - 3 uses
cats - 15 uses
cloud computing - 4 uses
conventions - 12 uses
cooking - 6 uses
copyright - 2 uses
cycling - 4 uses
doctor who - 13 uses
dr who - 2 uses
eastercon - 3 uses
edinburgh - 9 uses
elite - 3 uses
fandom - 3 uses
film - 27 uses
finland - 3 uses
gaming - 4 uses
germany - 3 uses
glasses - 2 uses
gripe - 2 uses
holiday - 8 uses
law - 129 uses
lawclanger - 36 uses
lego - 6 uses
london - 19 uses
mira - 10 uses
model-making - 15 uses
nelson - 3 uses
photo - 3 uses
photography - 49 uses
photos - 11 uses
reviews - 19 uses
science - 6 uses
sex offences - 3 uses
sf - 26 uses
silly - 11 uses
space - 21 uses
stormtroopers - 7 uses
stranger things - 2 uses
tv - 12 uses
via ljapp - 6 uses
wtf - 3 uses

Threaded | Top-Level Comments Only

From:

pjc50.livejournal.com

Well, yes: pretty much by definition in a compressed format each bit "means" more and has a greater effect on the output than in the uncompressed format.

Usually there's lots of error correction and detection going on anyway in the storage layer. The pattern on disk won't be the same as the bit pattern of the thing you're storing. The trick with adding useful error correction is to put it in a place where it's protecting against different sorts of problems from the layer below. So you have a hierarchy: for example, disks that do their own error-correction, arranged into a RAID5 array, with an offsite backup.

A side effect of the hardware doing its own error correction might well be that it's hard to get a minimally damaged block out of it - you might have an entire 1k of zeroes instead of the data you stored with one bit flipped. Protecting against that would involve distributing the redundancy all through the file. You might add 3 bits for every 8 bits of data, but store all those 11 bits in different blocks. This ups the read and write time but might be more useful for archival purposes.

bellinghman.livejournal.com

Tangentially to this issue, I've been known to take textual data, compress it using zip, and then UUEncode it to get it across a slow link faster than the original could be carried.

autopope.livejournal.com

Purely by coincidence, this showed up on slashdot yesterday! Reed-Solomon ECC codes have a long history (they were the data integrity check in the original CD audio format) and here's an example of how to apply them to files.

blue-condition.livejournal.com

You should tell

ayrton_nix about this if she doesn't already know.

marypcb.livejournal.com

this is why there's a move to XML formats for documents, away from binary, and why I'm so frustrated that people complain about Office 2007 - the British Library folk are very cogent on this

http://www.dcc.ac.uk/resource/interviews/richard-wright/ has more detail

Simon Bradshaw

Blog Alert: Digital Curation Blog

Blog Alert: Digital Curation Blog

no subject

no subject

no subject

no subject

no subject

no subject

Profile

January 2022

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags