Diagnosing a bug in a SQL Server stored procedure recently I unearthed an interesting ‘feature’ of the CHECKSUM function which I couldn't find documented elsewhere.
The stored procedure received some data as Unicode string arguments, and after some validation and processing stored it in tables defined with VARCHAR columns. (Yes, I know this sounds like a less than optimal design, but this was an application integration task and both the source data and the target database schema were defined by other applications). The problem was arising in another stored procedure which used the data from the tables, involving a join on a hash column which the first stored procedure had calculated using CHECKSUM. This join had the expected behaviour in most environments, but failed to generate any of the expected rows when executed on a particular SQL Server instance.
The cause of the problem was eventually tracked down to the fact that the stored procedure code was not being completely fastidious about conversions between VARCHAR and NVARCHAR in the parts of the code where the CHECKSUM hash was being generated. On most SQL instances, this was not a problem since CHECKSUM(<VARCHAR>) yielded exactly the same value as CHECKSUM(<NVARCHAR>) for the string values concerned. It transpired, however, that for certain collations this relationship does not hold true, and the SQL Server instance where the problem emerged happened to have a default collation which exposed this difference.
The following query evolved during diagnosis of the problem, and illustrates where the potential pitfalls lie when using CHECKSUM in T-SQL code which may need to be portable between SQL Server instances with potentially differing collations:
set nocount on
create table #collations (name sysname, sql nvarchar(1000))
create table #compare (collationName sysname, vcharHash int, nvcharHash int)
insert #collations
SELECT
name,
'SELECT ''' + name + ''', CHECKSUM(''1234'' COLLATE ' + name + '), CHECKSUM(N''1234'' COLLATE ' + name + ')'
FROM ::fn_helpcollations()
declare c cursor for select sql from #collations
declare @sql nvarchar(1000)
open
c fetch next from c into @sql
while @@FETCH_STATUS = 0
begin
-- exclude Hindi collations from consideration as they can only apply to NVARCHAR values
if NOT @sql like 'SELECT ''Hindi_%'
begin
insert #compare
exec sp_executesql @sql
end
fetch next from c into @sql
end
close c
deallocate c
print 'Collations where CHECKSUM(VARCHAR) differs from CHECKSUM(NVARCHAR)'
select * from #compare where vcharHash != nvcharHash
print 'Distinct CHECKSUM values where CHECKSUM(VARCHAR) is the same as CHECKSUM(NVARCHAR)'
select case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END AS [Sort type], MAX(vcharHash) as [Hash Value], COUNT(*) as [Number of collations] from #compare where vcharHash = nvcharHash
group by vcharHash, case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END
drop table #collations
drop table #compare
If you try running this on any SQL Server instance, you will note that:
- The relationship CHECKSUM(VARCHAR) == CHECKSUM(NVARCHAR) holds (at least for the test string "1234" and all others that I have tried) for all collations other than the SQL Server sort order collations provided for backward compatibility with older SQL Server versions
- CHECKSUM appears to always generate a different result with binary sort order collations than it does with dictionary sort order collations
So, the moral of the story is: if you use CHECKSUM to generate hash values for use in indices or comparisons, be very careful and consistent about the collation of the expressions you pass to CHECKSUM, or you may end up with obscure latent bugs.