In this post I cover off how to choose the correct number for the BUCKET_COUNT and how you go about that, also, how to monitor and change the bucket_count.
The general approach is that you set the BUCKET_COUNT to the number of unique values there will be given just the columns on your hash index (see SQL below), you need to round up to the next power of 2, I’ve made that easier by giving you a list of the available BUCKET_COUNT’s starting from 1024 and above.
select count(*) as count_unique_values from ( select distinct system_type_id, object_id from sys.columns ) as unq |
Available list of BUCKET_COUNT actual sizes starting from 1024 and up to the maximum…
1024 – 8 kilobytes
2048 – 16 kilobytes
4096 – 32 kilobytes
8192 – 64 kilobytes
16384 – 128 kilobytes
32768 – 256 kilobytes
65536 – 512 kilobytes
131072 – 1 megabytes
262144 – 2 megabytes
524288 – 4 megabytes
1048576 – 8 megabytes
2097152 – 16 megabytes
4194304 – 32 megabytes
8388608 – 64 megabytes
16777216 – 128 megabytes
33554432 – 256 megabytes
67108864 – 512 megabytes
134217728 – 1,024 megabytes
268435456 – 2,048 megabytes
536870912 – 4,096 megabytes
1073741824 – 8,192 megabytes
Why we need to use the unique number of values as the BUCKET_COUNT
Say we have a Hash Index on {Firstname, Surname}, the binary concatenated value of {Firstname, Surname} complete with collation will be passed through a hash function, that hash function will return a value in the range zero through to your defined BUCKET_COUNT. For all values having the same exact {Firstname, Surname} binary combination (yes, case and collation sensitive) the hashing function will yield the same result – that value is the position in the Mapping Table that is part of the Hash Index.
That means if your Hash index columns only have 2 unique values then you can only ever have two unique hashes! So, you may have a table with a million rows in it, little point setting your BUCKET_COUNT to the next power of 2 above 1 million i.e. 1,048,576 because it will only ever use just two of those buckets!
I’ll not go into hash collisions in the post because I cover them in my post on “What is hashing; using modulus to partition data“.
I don’t know how many unique values I’m going to have
You need to specify BUCKET_COUNT to define a hash index and once set cannot be changed. You can’t drop and recreate indexes on in-memory tables so you are a bit stuck if you get it wrong. If you miscalculate the BUCKET_COUNT and make it too small you’ll end up with longer row chains hanging off the hash buckets, that will cause performance degradation.
A BUCKET_COUNT of 8,388,608 i.e. recommended for 8 million unique values is only 64MiBytes – 8 bytes per hash bucket, there is no concept of B-Tree or BW-Tree, the hash index is simply a 1 dimensional array of 8 byte memory pointers.
How to modify BUCKET_COUNT on your Hash Index
You need to drop and recreate the table to modify BUCKET_COUNT.
Personally I’d not use a temporary table i.e. a # or ## because if you are unlucky and your SQL Server instance somehow restarts after you’ve dropped the in-memory table then you’ll have lost your data!
Example of changing BUCKET_COUNT
-- -- Test table to change hash bucket_count -- create table trn_bc ( id int not null constraint pk_trn_bc primary key nonclustered hash with( bucket_count = 512 ), entry_date datetime2 not null default( current_timestamp ) ) WITH ( MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_AND_DATA ); insert trn_bc ( id, entry_date ) select isnull( max( id ), 0 ) + 1, getdate() from trn_bc; go 5000 -- Populate with 5,000 rows -- -- Make a copy in a separate database (not tempdb) -- select * into xtp_demo.dbo.copy_trn_bc from trn_bc; -- -- Drop and recreate table with correct bucket_count -- drop table trn_bc; create table trn_bc ( id int not null constraint pk_trn_bc primary key nonclustered hash with( bucket_count = 8192 ), entry_date datetime2 not null default( current_timestamp ) ) WITH ( MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_AND_DATA ); -- -- Copy data back in -- insert trn_bc ( id, entry_date ) select id, entry_date from xtp_demo.dbo.copy_trn_bc; |
Monitoring bucket usage
select total_bucket_count, empty_bucket_count, total_bucket_count - empty_bucket_count as used_buckets, avg_chain_length, max_chain_length from sys.dm_db_xtp_hash_index_stats a where object_name( a.object_id ) = 'trn_bc' |
If you have a large maximum / average row chain then you either have a lot of update activity going on and therefore a lot of versions (I’ll cover the multi version concurrency control in a separate post).
If you run the script further above that creates trn_bc with 5,000 rows and a bucket_count of 8192 then sys.dm_db_xtp_hash_index_stats will show that you’ve used 4,949 buckets for the 5,000 rows – there must of been 51 hash collisions.
If over time you have a small percentage of used_buckets compared to total_bucket_count then it’s probably worth dropping your bucket_count own to save some memory, conversely if your empty_bucket_count is nearly 0 then you need to think about increasing the bucket_count.
Summary
That concludes this post on how to choose the correct bucket_count. The take away should be that you need a bucket_count that covers at least the number of unique values in your hash index – not unique rows in your table, but unique according to the columns in your hash index. Monitor and make sure you have the correct amount of empty buckets – not too many which may point to a hash key choice that has low cardinality (low number of unique values) and not too few.