

In this article, we can find a few tips to optimize HBase read/write operations, using the best practices available to design an HBase structure. It contains a brief description of Apache HBase, a summary of the main issues that can drastically reduce it performances and some conclusions based on our experiences.
HBase: a brief description
What is Apache HBase ?
HBase was born in 2008, as open-source implementation of Google’s paper “Bigtable: A Distributed Storage System for Structured Data”, written by Chang et al.
Today, Apache HBase is one of the most important, and widely-used, column-oriented NoSQL Databases existing. It allows users and developers to:
- Manage huge datasets as key-value tuples.
- Manage real-time value retrieval by key.
- Access strongly consistent, resilient versions of their datasets.
How does it work ?
Architecture
HBase has a distributed architecture based on the master/slave paradigm. Its architecture consists mainly in the following entities:
- HFile: low-level storage unit that stores row ordered lexicographically by key. These unit files can be stored on different distributed file systems, however Apache HDFS is most commonly used.
- Region: A place, located in a single host, containing a number of HFiles, containing an ordered partition of keys contained in the table.
- Region Server: A single host machine containing a number of regions. It handles read, write, update and delete requests from clients.
- Master: A host machine, assigning regions to region servers, Performs Administration (DDL operations) duties, and checks the region servers for failover mechanisms.
- MemStore: In-memory write buffer. It stores incoming data that hasn’t been written yet to the disk, but can already be queried for data.
Data Model
The data structure of an HBase row can be exploded in the following parts:
- Table: a logical collection of rows stored (similar to a table in relational DBs).
- Namespace: a logical grouping of HBase tables (similar to a database in relational DBs).
- Key: a unique row identifier.
- Column Family: a logical grouping of similar columns, e.g. having the same type or format, stored under the same row key identifier.
- Column: a field containing a value, stored under thesame row key identifier and column family.
- Timestamp: time of last insertion/update of the value, as recorded by HBase.
It is important to keep in mind that data in HBase is both stored and returned to clients in binary format.
Mutations
Mutations are atomic operations which can be performed over an HBase table. There are two main different mutations that can be applied on a given HBase row:
- PUT: operation to insert or update one single specific field of a single row, with a value. Thus, the operation is done for a triplet (key, field, value) that adds a new row if it doesn’t exist. If the table is versioned, adds a new version or replaces the oldest of the given field.
- DELETE: operation to delete a specific row identified by rowkey. HBase does not delete the data immediately: it writes a deletion marker (tombstone marker) to the HFile. The tombstone markers are then later processed during major compactions, where HFiles are physically rewritten without the marked data.
To summarize, HBase is a database offering an extremely fast key/value information retrieval over a distributed dataset (order of milliseconds), providing strong consistency, atomicity and fault tolerance.
This sounds great on paper, but what could go wrong ?
HBase common mistakes
Despite its excellent features on paper, there are a few common mistakes we can make designing HBase table that can drastically decrease HBase performances. Just to cite a few, performances may degrade in case of:
- Bad rowkey design
- Providing a wrong number of regions
- Exceeding the number of column families
In particular, the first two steps can both introduce a symptom called hotspotting:
Hotspotting happens when a large amount of traffic from various clients (read/write) is constantly redirected to single or very few numbers of Region Servers
It sounds just like a Denial of Service attack (DOS), doesn’t it?
So, the question is: how can we avoid this behaviour and maintain good performances?
Designing a proper HBase rowkey
HBase row keys must always be designed properly. These keys are stored as bytes, and sorted lexicographically in tables. This brings faster row access and retrieval for subsequent rows.
When designing a rowkey, keep in mind the following points, at all times:
- Keep its size LOW → For example, when creating prefer Integers to Strings (lower impact as binary representation). The String representation of an integer number occupies a lot more memory than a simple Integer, when converted as binary.
- Make it EASY to split → Avoid monotonically increasing keys (i.e. auto-incremental ID, timestamps, …), because their data domain isn’t known, and is therefore hard to split into partitions. Prefer, when possible, strings or numbers with a fixed number of characters/digits.
If you can’t avoid to use monotonically increasing keys, you can attempt to:
- Salt the key → add a random prefix in the key. Usually, the number of different prefixes is linked to number of regions you want to divide your data. For instance, if we have 3 regions, we may salt:
0001 → a-0001
0002 → b-0002
0003 → c-0003 - Hash the key → use a one-way hash function on the key. The hash function should fix the key data domain, and help spreading the data accross region servers, i.e.:
0001 → 25bbdcd06c32d477f7fa1c3e4a91b032
0002 → fcd04e26e900e94b9ed6dd604fed2b64
0003 → 7cd86ecb09aa48c6e620b340f6a74592 - Reverse the key → revert the key characters, moving the slowly increasing part (the least significant digit) to the beginning, i.e.:
0001 → 1000
0002 → 2000
0003 → 3000
More details on reversing the key..
An easy way to avoid the hotspotting consists in reversing the rowkey: this projects monotonically increasing keys in a new space, with a different variables distribution. This way, it can be easily splitted in subparts.
PROS: clients can easily apply the reverse function (it can be uncomfortable sharing the function used to apply salt/hash on keys).
CONS: previous row ordering properties aren’t preserved.
By following these tips, your keys should be now equally distributed. This brings us to the next step: how do I choose the correct number of regions?
Designing and sizing HBase regions
Unless instructed otherwise, an HBase table starts with one region only, containing all data. An automatic split occurs when a region grows above the maximum size setted in HBase properties (10GB, by default).
This approach is good in most common scenarios, but, when you have to load a massive dataset, this may not be quick or proportionate enough, and can cause hotspotting.
In order to avoid this behaviour, we can create a table with predefined split points; this allows us to decide the region count and sizing before inserting the data. This way, we can optimize insert operations, parallelize the write requests to regions in different region servers and rebalance read requests on each region server available.
Suggestion: if you have to rewrite your HBase table entirely, without losing split points and other metadata, you can use the following command in HBase shell (or using API):
truncate_preserve ‘namespace:tablename’
This command removes all data from the table, while preserving metadata on the split points previously defined: when massively updating a big HBase dataset, your job will distribute more evenly the PUT requests on the region servers, improving processing time and response delay.
However, on the other side, you should never set too many split points. Those many regions will then generate several small files and trigger frequent HBase compactions. A compaction is an internal process where different small files are rewritten to bigger ones. If this process is triggered frequently, it might result in excessive resource usage (i.e. high I/O operations on filesystem).
Designing and sizing Column Families
Another very common HBase sizing dilemma is:
How many Column Families (CFs) should my table have?
A Column Family is a logical grouping of columns (think SQL tables) being stored separately and independently in files.
Should we make every column a column family? The answer is NO: Typically, this is a symptom of improper schema design, as it causes:
- Very frequent HBase high compactions, since column families make up independent, smaller files
- Every data retrieval implies opening several files, along with its overhead
To maintaing good performances, keep the count of column families less than or equal to 3. In particular, try to group up your columns based on:
- Pattern access: group the columns will be together required in the same request)
- Datatype: divide high size value columns (i.e. binary columns) from others (i.e. String columns)
Furthermore, try to keep the name of column families as SHORT as possible: the column family name is actually stored in HBase for every single column in a row, so keeping the name short does help optimizing the space occupied on storage.
Typically, our HBase tables should contain one single column family, and its name should consist in a single letter (d for data, for instance). In case of tables containing at least hundreds or thousands of fields, at most 2 new column families may be added, after verifying that response times actually benefit from the process.
Conclusions
In this article, we analyzed a few common mistakes and mispractices that can cause our HBase table to underperform, and not to meet the SLAs that we could expect from the tool.
In our experience, Apache HBase is an extremely vaild and reliable tool that lives up to its expectations, given it is deployed with proper resources, and employed in the correct use cases.
Hence, we suggest you to take care of the details described in this article, because what is currently working for you may require a rework in the future, when the amount of data scales up, in terms of rows, columns and value size.