Description: C:\Users\Chris\Desktop\Manning\Images\cover images\Byrne-PPS-720x900.jpg

 

By Dennis Byrne

This article explores securing data using keyed hashing in Python.


Take 40% off Practical Python Security by entering fccbyrne into the discount box at checkout at manning.com.


Data authentication

Suppose Alice, a computer programmer, wants to secure the data of a document management system. The system currently stores each new document with a hash value. To verify the integrity of a document the system rehashes it and compares the new hash value to the old hash value. If the hash values don’t match, the document is considered corrupt. If the hash values do match, the document is considered intact.

Alice’s system effectively detects accidental data corruption but it is less than perfect. Mallory, a malicious attacker, can potentially take advantage of Alice. Suppose Mallory gains write access to Alice’s file system. From this position she can not only alter a document, she can also replace its hash value with the hash value of the altered document. By replacing the hash value, Mallory prevents Alice from detecting that the document has been tampered with. Alice’s solution can therefore only detect accidental data corruption, it cannot detect intentional data modification.

If Alice wants to resist Mallory she’ll need to change the system to verify the integrity and the origin of each document. The system can’t just answer the question, “have the data changed?” The system must also answer the question, “who authored these data?” In other words, the system will need to ensure data integrity and data authentication.

Data authentication, sometimes referred to as message authentication, ensures a data reader can verify the identity of the data writer. This functionality requires two things: a key and a keyed hash function. In the next sections I cover key generation and keyed hashing; Alice combines these tools to resist Mallory.

Key generation

Every key should be hard to guess if it is to remain a secret. In this section I compare and contrast two types of keys: random numbers and passphrases. You’ll learn how to generate both, and when to use one or the other.

Random Numbers

There is no need to use a third party library when generating a random number; there are plenty of ways to do this from within Python itself. Only some of these methods, however, are suitable for security purposes. Python programmers traditionally use the os.urandom function as a cryptographically secure random number source. This function accepts an integer num and returns num random bytes. These bytes are sourced from the operating system. On a UNIX-like system this is /dev/urandom; on a Windows system this is CryptGenRandom.

 
 >>> import os
 >>>
 >>> os.urandom(16)
 b'\x07;`\xa3\xd1=wI\x95\xf2\x08\xde\x19\xd9\x94^'
  

An explicit high level API for generating cryptographically secure random numbers, called the secrets module, was introduced in Python 3.6. There is nothing wrong with os.urandom, but in this book I use the secrets module for all random number generation. The secrets module features three convenient functions for random number generation. All three functions accept an integer and return a random number. Random numbers can be represented as a byte array, hexadecimal text, and URL safe text. The prefix for all three function names, shown by the following code, is token_.

 
 >>> from secrets import token_bytes, token_hex, token_urlsafe
 >>>
 >>> token_bytes(16)    #A
 b'\x1d\x7f\x12\xadsu\x8a\x95[\xe6\x1b|\xc0\xaeM\x91'    #A
 >>>
 >>> token_hex(16)    #B
 '87983b1f3dcc18080f21dc0fd97a65b3'    #B
 >>>
 >>> token_urlsafe(16)    #C
 'Z_HIRhlJBMPh0GYRcbICIg'    #C
  

#A Generating 16 random bytes

#B Generating 16 random bytes of hexadecimal text

#C Generating 16 random bytes of URL safe text

Type the following command to generate 16 random bytes on your computer. I’m willing to bet you got a different number than I did.

 
 $ python -c 'import secrets; print(secrets.token_hex(16))'
 3d2486d1073fa1dcfde4b3df7989da55
  

A third way to obtain random numbers is the random module. Most of the functions in this module do not use a secure random number source. The documentation for this module clearly states it “should not be used for security purposes” (https://docs.python.org/3/library/random.html). The documentation for the secrets module asserts it “should be used in preference to the default pseudo-random number generator in the random module” (https://docs.python.org/3/library/secrets.html).

WARNING  Never use the random module for security or cryptographic purposes. This module is great for statistics but unsuitable for security or cryptography.

Passphrases

A passphrase is a sequence of random words rather than a sequence of random numbers. The following script uses the secrets module to generate a passphrase composed of four words randomly chosen from a dictionary file. The script begins by loading a dictionary file into memory. This file ships with standard Unix-like systems. Users of other operating systems will have no problem downloading similar files from the web (https://www.karamasoft.com/UltimateSpell/Dictionary.aspx). The script randomly selects words from the dictionary using the secrets.choice function. This function returns a random item from a given sequence.

Listing 1. Generating a four-word passphrase

 
 from pathlib import Path
 import secrets
  
 words = Path('/usr/share/dict/words').read_text().splitlines()    #A
  
 passphrase = ' '.join(secrets.choice(words) for i in range(4))    #B
  
 print(passphrase)

#A Loading a dictionary file into memory

#B Randomly selecting four words

Dictionary files like this are one of the tools attackers use when executing brute force attacks. Constructing a secret from the same source is therefore non-intuitive. The power of a passphrase is size. For example the passphrase “whereat isostatic custom insupportableness” is 42 bytes long. According to www.useapassphrase.com the approximate crack time of this passphrase is 163,274,072,817,384 centuries. A brute force attack against a key this long is not feasible. Key size matters.

A random number and a passphrase naturally satisfy the most basic requirement of a secret—both key types are difficult to guess. The difference between a random number and a passphrase boils down to the limitations of long term human memory.

NOTE: Random number or passphrase?  

Random numbers are hard to remember and passphrases are easy to remember. This difference determines which scenarios each key type is useful for.

Random numbers are useful when a human does not or should not remember a secret for more than a few minutes. A multi-factor authentication token and a temporary reset password value are both good applications of random numbers. Remember how secrets.token_bytes, secrets.token_hex, and secrets.token_urlsafe are all prefixed with token_? This prefix is a hint for what these functions should be used for.

Passphrases are useful when a human needs to remember a secret for a long time. Login credentials for a web site or an SSH session are both good applications of passphrases. Unfortunately most Internet users are not using passphrases. Most public web sites do not encourage passphrase usage.

It is important to understand that random numbers and passphrases don’t just solve problems when applied correctly; they also create new problems when they are applied incorrectly. Imagine the following two scenarios where a person must remember a random number. First, the random number is forgotten and the information it is used to protect becomes inaccessible. Second, the random number is hand written to a piece of paper on a system administrator’s desk where it is unlikely to remain a secret.

Imagine the following two scenarios where a passphrase is used for a short term secret. First, you receive a password reset link containing a passphrase. Second, you receive a multi-factor authentication code as a passphrase. Wouldn’t a casual observer be more likely to remember one of these keys if they see it on your screen? As passphrases, both of these keys are less likely to remain a secret.

You now know how to safely generate a key. You know when to use a random number and when to use a passphrase. Both of these skills are relevant to many parts of this book, starting with the next section.

Keyed hashing

Some hash functions accept an optional key. The key, as shown in figure 1, is an input to the hash function just like the data. As with an ordinary hash function, the output of a keyed hash function is a hash value.


Figure 1. Keyed hash functions accept a key in addition to data


NOTE  In this book, I depict each hash function with a funnel. A hash function and a funnel both accept variable-sized inputs and produce fixed-size outputs. I depict hash values as fingerprints. A hash value and a fingerprint are both small pieces of information used to identify a message or a person, respectively.

Hash values are sensitive to key values. Hash functions using different keys produce different hash values of the same data. Hash functions using the same key produce matching hash values of the same data. The following code demonstrates keyed hashing with BLAKE2, a hash function that accepts an optional key.

 
 >>> from hashlib import blake2b
 >>>
 >>> m = b'same message'
 >>> x = b'key x'    #A
 >>> y = b'key y'    #B
 >>>
 >>> blake2b(m, key=x).digest() == blake2b(m, key=x).digest()    #C
 True    #C
 >>> blake2b(m, key=x).digest() == blake2b(m, key=y).digest()    #D
 False    #D
  

#A First key

#B Second key

#C Same key, same hash value

#D Different key, different hash value

Alice, working on her document management system, can add a layer of defense against Mallory with keyed hashing. Keyed hashing allows Alice to store each document with a hash value that only she can produce. Mallory can no longer get away with altering a document and rehashing it. Without the key, Mallory has no way of producing the same hash value as Alice when validating the altered document. Alice’s code, shown in listing 2, can therefore resist accidental data corruption and malicious data modification.

Listing 2. Alice resists accidental and malicious data modification

 
 import hashlib
 from pathlib import Path
  
 def store(path, data, key):
     data_path = Path(path)
     hash_path = data_path.with_suffix('.hash')
  
     hash_value = hashlib.blake2b(data, key=key).hexdigest()    #A
  
     with data_path.open(mode='x'), hash_path.open(mode='x'):    #B
         data_path.write_bytes(data)    #B
         hash_path.write_text(hash_value)    #B
  
  
 def is_modified(path, key):
     data_path = Path(path)
     hash_path = data_path.with_suffix('.hash')
  
     data = data_path.read_bytes()    #C
     original_hash_value = hash_path.read_text()    #C
  
     hash_value = hashlib.blake2b(data, key=key).hexdigest()    #D
  
     return original_hash_value != hash_value    #E
  

#A Hashes document with the given key

#B Writes document and hash value to separate files

#C Reads document and hash value from storage

#D Recomputes new hash value with the given key

#E Compares recomputed hash value with hash value read from disk

Most hash functions are not keyed hash functions. Ordinary hash functions, like SHA-256, do not natively support a key like BLAKE2. This inspired a group of really smart people to develop Hash-based Message Authentication Code (HMAC) functions. The next section explores HMAC functions.

HMAC functions

HMAC functions are a generic way to use any ordinary hash function as though it were a keyed hash function. As figure 2 illustrates, an HMAC function requires three inputs: data, a key, and an ordinary cryptographic hash function. Yes, you read that correctly; the third input to an HMAC function is another function. The HMAC function will wrap and delegate all of the heavy lifting to the function passed to it. The output of an HMAC function is, you guessed it, a hash-based message authentication code (MAC). A MAC is really just a special kind of hash value. In this book, for sake of simplicity, I use the term hash value instead of MAC.


Figure 2. HMAC functions accept three inputs: data, a key, and a hash function


Note: HMAC functions are underrated  

Do yourself a favor and commit HMAC functions to memory. HMAC functions are the solution to many of the challenges I cover throughout my book, for topics such as encryption, session management, user registration, and password reset workflows.

Python’s answer to HMAC is the hmac module. An HMAC function is initialized by passing a key and hash function constructor reference to the hmac.new function. The following code initializes an HMAC function with data, a key, and SHA-256. The digestmod kwarg designates the underlying hash function. Any reference to a hash function constructor in the hashlib module is an acceptable argument for digestmod.

 
 >>> import hashlib
 >>> import hmac
 >>>
 >>> hmac_sha256 = hmac.new(
 ...     b'key', msg=b'message', digestmod=hashlib.sha256)
  

WARNING   The digestmod kwarg went from optional to required with the release of Python 3.8. You should always explicitly specify the digestmod kwarg to ensure your code runs smoothly on different versions of Python.

The new HMAC function instance mirrors the behavior of the hash function instance it wraps. The digest and hexdigest methods, as well as the digest_size property, shown below, all behave the same.

 
 >>> hmac_sha256.digest()    #A
 b"n\x9e\xf2\x9bu\xff\xfc[z\xba\xe5'\xd5\x8f\xda\xdb/\xe4.r\x19\x01\x19v\x91sC\x06_X\xedJ"
 >>> hmac_sha256.hexdigest()    #B
 '6e9ef29b75fffc5b7abae527d58fdadb2fe42e7219011976917343065f58ed4a'
 >>> hmac_sha256.digest_size    #C
 32
  

#A Returns the hash value in bytes

#B Returns the hash value in hexadecimal text

#C Returns the hash value size

The name of an HMAC function is a derivative of the underlying hash function. For example, you might refer to an HMAC function wrapping SHA-256 as HMAC-SHA256.

 
 >>> hmac_sha256.name
 'hmac-sha256'
  

By design, HMAC functions are commonly used for data authentication. Sometimes, as with Alice’s document management system, the data reader and the data writer are the same entity. Other times, the reader and the writer are different entities. The next section covers this use case.

Data authentication between parties

Imagine that Alice’s document management system must now receive documents from Bob. Alice has to be certain each document has not been modified in transit by Mallory. Alice and Bob agree on a protocol:

  1. Alice and Bob share a secret key.
  2. Bob hashes a document with his copy of the key and an HMAC function.
  3. Bob sends the document and the hash value to Alice.
  4. Alice hashes the document with her copy of the key and an HMAC function.
  5. Alice compares her hash value to Bob’s hash value.

This protocol is illustrated by figure 3. If the received hash value matches the recomputed hash value then Alice can conclude two things:

  • The document was sent by someone with the same key, presumably Bob.
  • Mallory couldn’t have modified the document in transit.

Figure 3. Alice verifies Bob’s identity with a shared key and an HMAC function


Bob’s implementation of his side of the protocol, shown in listing 3, uses HMAC-SHA256 to hash his document before sending it to Alice.

Listing 3. Bob uses an HMAC function before sending his document

 
 import hashlib
 import hmac
 import json
  
 hmac_sha256 = hmac.new(b'shared_key', digestmod=hashlib.sha256)    #A
 message = b'from Bob to Alice'    #A
 hmac_sha256.update(message)    #A
 hash_value = hmac_sha256.hexdigest()    #A
  
 authenticated_msg = {    #B
    'message': list(message),    #B
    'hash_value': hash_value, }    #B
 outbound_to_alice = json.dumps(authenticated_msg)    #B
  

#A Bob hashes the document

#B Hash value accompanies document in transit

Alice’s implementation of her side of the protocol, shown in listing 4, uses HMAC-SHA256 to hash the received document. If both MACs are the same value then the document is said to be authenticated.

Listing 4. Alice uses an HMAC function after receiving Bob’s document

 
 import hashlib
 import hmac
 import json
  
 authenticated_msg = json.loads(inbound_from_bob)
 message = bytes(authenticated_msg['message'])
  
 sha256 = hmac.new(b'shared_key', digestmod=hashlib.sha256)    #A
 sha256.update(message)    #A
 hash_value = sha256.hexdigest()    #A
  
 if hash_value == authenticated_msg['hash_value']:    #B
    # trust message
    ...
  

#A Alice computes her own hash value

#B Alice compares both hash values

Mallory, an intermediary, has no way to trick Alice into accepting an altered document. With no access to the key shared by Alice and Bob, Mallory cannot produce the same hash value as them for a given document. If Mallory modifies the document or the hash value in transit then the hash value Alice receives will be different than the hash value Alice computes.

Take a look at the last few lines of code in listing 4. Notice how Alice uses the == operator to compare hash values. This operator, believe it or not, leaves Alice vulnerable to Mallory in a whole new way. That, however, is a topic that you’re going to have to read the book to learn more about.

If you want to see more, you can check out the book’s contents on our browser-based liveBook reader here.