To more easily complete queries, large databases use indexes. However, some databases are too large even for regular indexes to be performant. In those cases, hashing or hash indexing can be a more powerful, convenient alternative.
Here, we’ll cover the basics of hashing, its classifications, as well as situations where it’s the most useful.
DBMS Hash Indexing Explained
In a nutshell, hashing in DBMS is a way to quickly find data on disk without using a traditional index. Instead, the database uses a hash function to calculate the location of a given record.
It usually takes the primary key, finds the hash based on a certain mathematical function, and points to the appropriate bucket where the queried data is.
What Are Hash Buckets?
In hashing, buckets are essentially used to apportion data for lookup or sorting, similar to an array index. It’s where the hashes derived from the hashing function are stored, providing fast access to the required data.
Each bucket is identified via an address; so a bucket with one address holds all index entries for a certain search key, provided that function(search key) = address.
When Is Hashing Preferable to Standard Indexing?
In specific cases, hashing is superior to indexes. More precisely, hashing works better for point lookups and primary key-based lookups.
Hashing is only usable in equality/inequality checks, unlike indexes. But hashing performs better in that regard than B-tree indexes: its cost is equal to the number of pages in the bucket, so it’s much cheaper if there’s no overflow (more on overflow later).
Hash Functions Explained
Hash functions are the means through which a key becomes a hash of length-restricted value. These functions should have the following properties:
- They remain the same for every hash being made
- They’re easy to calculate quickly
- No two hashes should be alike—collision happens when two hashes are the same
- Every unique input causes a unique output
Common Hash Functions
- Division
- Mid-Square
- Folding
- Multiplication
Types of Hashing
Broadly speaking, we divide hashing techniques in two ways: by extendability and collision resolution.
Sorted by extendability, we use the following categorization:
- Static
- Dynamic
By collision resolution, hashing can be:
- Open (separate chaining)
- Closed (open addressing)
Static Vs Dynamic Hashing
In static hashing, hash tables have a fixed size. When a table is filled up, an overflow page is allocated to take in the brimming data. But this overflow page can also reach maximum capacity, after which we add another one. This chain is called the overflow chain.
Static hashing is a fast way to complete insert and delete operations, seeing that it only needs two I/Os, those being read and write. However, you do need to know the size of the hash table in advance. If you fail to optimize the hash table, the overflow chain can get quite long, causing a dip in performance and making scaling more difficult.
To compensate for the shortcomings of static hashing, dynamic hashing takes a different approach. Dynamic hash tables aren’t fixed; instead, they expand to accommodate more data. Conversely, if a dynamic hash table is too empty, it can be resized to save memory usage.
The advantage of dynamic hashing lies in its flexibility: you don’t have to plan its size in advance, and it still maintains an O(1) lookup/insertion performance. On the other hand, dynamic tables change data location as they grow or shrink. This can make performance inconsistent as data jumps around the table.
Open Vs Closed Hashing (Separate Chain Vs. Open Address)
For hashing to be effective, we have to reduce collisions (i.e., two hashes for different inputs being the same) as much as possible.
In closed hashing, we store records within the hash table array itself, and available fields for hashes are found via probing. There should be no more than one key per bucket. In contrast, open hashing entails using separate lists to store hashes (hence the term “separate” chains) with an arbitrary possible number of keys per table.
Why Is Closed Hashing Also Called “Open Address?”
The fact that “closed” hashing and “open” addresses are synonyms sounds confusing at first. But here’s why that’s the case: if the location of a record is independent of its hash value—as is the case in closed hashing—then its address is open by definition; it isn’t fixed.
Closed Hashing and Probing
In closed hashing, to find a place for a new hash entry or look for an existing record within a bucket array, a process called probing is necessary. Through probing, we examine the buckets in a given probe sequence (mainly linear, double-hashing, or quadratic) and look for an unoccupied slot.
Probing is an integral part of specific types of collision resolution. For example, Robin Hood hashing displaces duplicate hashes based on PSL (probe sequence length), displacing those farther from the bucket where the item was originally hashed.
Best Use Cases for Hashing
In the following cases, hashing gives you the most bang for your buck:
- Graphics
- Game boards
- Message digests
- Compiler operation
- Password verification
- Linking file names and paths
Hashing algorithms excel in various scenarios where fast data retrieval, security, and efficiency are essential. Graphics processing, where quick access to pixel data or image properties is crucial, benefits from hashing's speed. Similarly, game boards, particularly in complex video games, rely on hashing to quickly determine the state of the game and make efficient moves.
In cybersecurity, hashing plays a pivotal role in generating message digests, ensuring data integrity, and detecting tampering. Moreover, during compiler operation, hashing helps expedite symbol table lookups and optimize code generation. Password verification is another prime application, enhancing security by storing and comparing password hashes instead of the actual passwords.
Lastly, when linking file names and paths in file systems, hashing enables rapid access and reduces the need for exhaustive searches. In these use cases, hashing offers a powerful solution to streamline operations and enhance both performance and security.
Conclusion
In conclusion, understanding the power of hashing in the world of database management is crucial. It provides an efficient means to access data without the traditional index, making it particularly valuable in cases of extensive databases. Hashing technology's ability to distribute and locate data quickly offers substantial advantages for point lookups and primary key-based searches.
As you explore the possibilities of implementing hashing in your database system, consider innovative solutions like IneryDB.
IneryDB leverages the efficiency of hashing to enhance data retrieval and system performance, contributing to the ever-evolving landscape of decentralized and efficient data management. So, as you navigate the world of database management, remember that hashing plays a pivotal role, and solutions like IneryDB can take your data management to new heights.
Inery•
1 year ago
Inery's Binance Use Case: Understanding the Technical Aspect
Last week, we talked about Binance’s database issue and how Inery can solve it. In this week’s blog, we are going to take a closer look at the technical aspect of Inery’s Binance use case. ...READ MORE
Share
Inery•
1 month ago
IneryDB – Where Edge Computing Meets Blockchain
Explore the benefits of edge computing and learn how Inery’s blockchain-powered DBMS enhances security, scalability, and efficiency across various applications. ...READ MORE
Share
Inery•
2 years ago
How Inery Uses Proof of Stake in its Blockchain
Everything you need to know about the decentralized consensus algorithm. ...READ MORE
Share
Inery•
1 year ago
What Life With Decentralized Data Would Look Like
You’d be surprised how much your life would change in a decentralized world. Take a peek at the future by clicking here. ...READ MORE
Share
Most popular today