Learn Ethereum in 2024. #10. Hash functions and Ethereum.

7 min readMar 26, 2024

Many still have the idea that encryption is limited to encrypting texts that can only be deciphered using a secret key. In fact, historically, this was the original application of cryptography and remains its best-known application. However, cryptography has evolved significantly, now encompassing a wide array of concepts and applications.

Cryptographic primitives

This ‘cryptography as we know it,’ involving the encryption of texts with a secret key, represents just one type of cryptographic primitive. Nowadays, there exist several cryptographic primitives, each serving as fundamental components that enable the achievement of various goals in cryptography. For instance, in the case of encrypting texts, the primary objective of the cryptographic primitive is confidentiality, aiming to conceal the original text from potential interceptors of the message. However, in establishing secure communication, confidentiality alone is not sufficient; authenticity and integrity of messages must also be ensured, necessitating the existence of other cryptographic primitives.

The focus of this article is hash functions, a cryptographic primitive designed to generate a unique identifier for any type of data, essentially providing a digital identity for a text or document, regardless of its length. While encrypting messages aims to ensure their confidentiality, the primary purpose of hash functions is to guarantee data integrity. Let’s explore an example of how hash functions can be utilized to ensure the integrity of a file.

Data integrity

In the figure below, we observe the download screen for the Ethereum geth client, featuring its various versions. Each version includes information about the file, including a checksum known as MD5. MD5, short for MD5 Message Digest Algorithm 5, is a hash function designed to uniquely identify any file (set of bytes) with a 128-bit identifier, or 16 bytes. The purpose of this identifier is crucial: it is displayed on the application’s official website to prevent third parties from offering a modified file as if it were the original. Allow me to elaborate further.

MD5 generates a digital identification of the file, sometimes called a checksum.

It is common for such files to be hosted on multiple content servers worldwide. Suppose a malicious actor alters the original file, such as replacing it with a malicious application. Using hash functions, it becomes straightforward to determine whether the file is original or tampered with. By simply running the hash function on the downloaded file and comparing it with the identifier displayed on the website, we can detect any discrepancies. If the hashes do not match, it is an indication that the file has been tampered with. In the image below, I utilized an online service, https://emn178.github.io/online-tools/md5\_checksum.html, to generate the MD5 hash of the downloaded file. Subsequently, I verified that it matches the hash indicated on the official Geth page.

The hash generated by the file is the same as that indicated on the official website.

Naturally, if the webpage hosting the hash value is compromised and the hash value is tampered with, the entire integrity verification scheme is compromised. This underscores the importance of employing multiple cryptographic primitives together to form a comprehensive security framework. In the scenario mentioned, webpages are often further protected through encryption and authenticated by certification authorities.

MD5 is a hash function that is no longer considered secure due to the discovery of collisions. To comprehend this, it’s essential to grasp some indispensable properties of hash functions.

Hash function properties

Deterministic and fixed-size output. Hash functions must exhibit determinism, ensuring that the same input consistently generates the same output. This property is essential for the output to effectively serve as an identifier of the input. The input can vary widely in size, ranging from a single word, event an empty string, to a huge file. The output of the hash functions we will consider has a fixed size. Generally, the larger the output size, the more secure the hash function. As previously mentioned, MD5 produces a 128-bit output and is no longer considered safe due to its relatively small size, as we will discuss further. Currently, the most widely used hash functions have a 256-bit output, but there are functions with larger output sizes, such as 512 bits or even more.
Pre-image resistance. It is a critical property of hash functions, ensuring that it’s computationally infeasible to find an input that generates a specific output. In simpler terms, hash functions should not be reversible. The only viable approach to attacking a hash function should involve attempting all possible inputs until finding one that produces the desired output. For instance, when hashing a name, it should be impossible to retrieve the original name from the hash value. However, a potential attack might involve hashing all conceivable names and comparing them to the generated hash.
Second pre-image resistance. It is another crucial property of hash functions, ensuring that given an input that generates a specific output, it’s computationally infeasible to find a second input that produces the same output. This property holds particular importance in blockchains. For instance, in the header of each block, a hash value is included that corresponds to the root of a data structured named a Merkle tree. This hash value can be used as proof that a specific transaction occurred, as only that transaction will lead to the same root in the Merkle tree. If it were possible to find a second transaction that leads to the same hash root, it could falsely “prove” the occurrence of this second transaction, even if it never happened.
Collision resistance. It is a vital property of hash functions, similar to second pre-image resistance but subtly different. It asserts that it’s computationally infeasible to find two arbitrary inputs that produce the same output. The distinction from second pre-image resistance lies in the nature of the inputs, which now are arbitrary. If any collision is discovered, indicating two inputs resulting in the same output, the hash function should be deemed insecure and discontinued. To grasp when collisions become likely, we can briefly discuss the birthday paradox in probability. Let’s consider a scenario where a group of people gathers in a room. The question is: how many people should you gather so that the probability of two people sharing the same birthday is greater than 50%? Despite there being 365 days in a year, it turns out that only around 23 people are needed to achieve this probability. This might seem surprising, but collisions become likely in square root order of trials. Similarly, in the context of hash functions, for a hash function with a 128-bit output, finding collisions would require trying approximately 2⁶⁴ inputs, a task that is computationally feasible today. However, for functions with a 256-bit output, 2¹²⁸ attempts are necessary, which is currently considered infeasible.
The avalanche effect. It is a crucial property of hash functions, ensuring that small changes in the input result in significant changes in the output. This means that the hash of the word “hello” differs as much from the hash of the word “Hello” (capital H) as it does from the hash of a music file. Without the avalanche effect, two similar inputs would produce similar hashes, potentially leaking information.

Hash functions on Blockchain

Hash functions are extensively utilized in blockchain technology. One of their initial applications was in proof-of-work, which involves finding a specific number that satisfies a particular condition. In Bitcoin, for instance, miners aim to discover a random number such that when concatenated with the block header, the resulting hash is less than a predetermined value.

The output of a hash function consists of a collection of bits, which can be interpreted as an integer. For example, with a 256-bit hash function, the output can be seen as a number ranging from 0 to 2²⁵⁶ — 1. While it’s infeasible to find an input whose hash is less than a relatively small value, such as 1 billion, it’s feasible for a standard computer to discover an input whose hash is less than 2²³⁰, for instance. However, it’s essential to note that, according to the definition of hash functions, this process can only be achieved through trial and error.

Hash functions in Ethereum

The most commonly used hash function on Ethereum is Keccak256. However, it’s crucial to exercise caution, as some libraries may refer to Keccak256 as SHA-256, which is a distinct hash function. SHA-256 is a variant of SHA-2 (Secure Hash Algorithm 2) with a 256-bit output and is extensively employed by Bitcoin.

On the other hand, the Keccak256 function is a variant of SHA-3–256, although it’s worth noting that SHA-3–256 is not precisely the same hash function as Keccak256. Despite their differences, all these hash functions are considered secure. Nonetheless, they produce different results when applied to the same input. We will explore the usage of Keccak-256 on Ethereum when discussing accounts within the network.

Hash functions can be challenging to understand, but the key takeaway from this article is their role in generating a universal identifier for any type of data. Given a hash, it is computationally impractical to determine its original input, just as it is impractical to find another input that generates the same hash. Hash functions are integral to blockchain security, a topic we will delve into further in subsequent articles of this series.