Checksums

A checksum or hash s a small piece of data used by the EIDC to verify data integrity and to ensure no errors have been introduced during a dataset's transmission or storage.

What are checksums?

Checksums (alternatively known as hashes) are "fingerprints" created by applying a procedure (called a "checksum algorithm") to a file. When the algorithm is applied to the file, it generates a simple hexadecimal string - the checksum/hash.

checksums_guidance1.png

If the algorithm is applied repeatedly to the same file (or an identical copy of the file) it will always generate the SAME checksum. However, if a file is changed, even slightly, it will generate a completely different checksum.

checksums_guidance2.png

How does the EIDC use checksums?

Checksums are used to verify that a file or group of files has not changed. This is crucial in the EIDC because integrity of the resources we safeguard is essential.

We create a checksum report when we receive data.  This helps us to ensure that a resource has not been changed or corrupted if it moves from one location to another during the process of secure long-term storage.

Checksums also help to provide verification against accidental or deliberate tampering, virus infection or corruption of resources.

What does an EIDC checksum report look like?

The EIDC uses the MD5 algorithm. When a data file is passed through the algorithm, it generates a 32 digit hexadecimal number.  For example, c1e75ab8269045b0b25473fcb275932d.

When we accept a data deposit, we provide the depositor with a checksum report. This is a simple text file which contains a list of files in the deposit along with the hash for each file:

checksum report

You can use this to verify that the data we've received has not become corrupted during delivery and is the same as the data you submitted for deposit.