Charteris Community Server

Welcome to the Charteris Community
Welcome to Charteris Community Server Sign in | Join | Help
in Search

Chris Dickson's Blog

SQL Server: The effect of collation on CHECKSUM

Diagnosing a bug in a SQL Server stored procedure recently I unearthed an interesting ‘feature’ of the CHECKSUM function which I couldn't find documented elsewhere.

The stored procedure received some data as Unicode string arguments, and after some validation and processing stored it in tables defined with VARCHAR columns. (Yes, I know this sounds like a less than optimal design, but this was an application integration task and both the source data and the target database schema were defined by other applications). The problem was arising in another stored procedure which used the data from the tables, involving a join on a hash column which the first stored procedure had calculated using CHECKSUM. This join had the expected behaviour in most environments, but failed to generate any of the expected rows when executed on a particular SQL Server instance.

The cause of the problem was eventually tracked down to the fact that the stored procedure code was not being completely fastidious about conversions between VARCHAR and NVARCHAR in the parts of the code where the CHECKSUM hash was being generated. On most SQL instances, this was not a problem since  CHECKSUM(<VARCHAR>) yielded exactly the same value as CHECKSUM(<NVARCHAR>) for the string values concerned. It transpired, however, that for certain collations this relationship does not hold true, and the SQL Server instance where the problem emerged happened to have a default collation which exposed this difference.

The following query evolved during diagnosis of the problem, and illustrates where the potential pitfalls lie when using CHECKSUM in T-SQL code which may need to be portable between SQL Server instances with potentially differing collations:

set nocount on

create table #collations (name sysname, sql nvarchar(1000))

create table #compare (collationName sysname, vcharHash int, nvcharHash int)

insert #collations

SELECT

name,

'SELECT ''' + name + ''', CHECKSUM(''1234'' COLLATE ' + name + '), CHECKSUM(N''1234'' COLLATE ' + name + ')'

FROM ::fn_helpcollations()

 

declare c cursor for select sql from #collations

declare @sql nvarchar(1000)

 

open c

fetch next from c into @sql

while @@FETCH_STATUS = 0

begin

      -- exclude Hindi collations from consideration as they can only apply to NVARCHAR values

      if NOT @sql like 'SELECT ''Hindi_%'

      begin

            insert #compare

            exec sp_executesql @sql

      end

      fetch next from c into @sql

end

 

close c

deallocate c

 

print 'Collations where CHECKSUM(VARCHAR) differs from CHECKSUM(NVARCHAR)'

select * from #compare where vcharHash != nvcharHash

 

print 'Distinct CHECKSUM values where CHECKSUM(VARCHAR) is the same as CHECKSUM(NVARCHAR)'

select case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END AS [Sort type],  MAX(vcharHash) as [Hash Value], COUNT(*) as [Number of collations] from #compare where vcharHash = nvcharHash

group by vcharHash, case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END

 

drop table #collations

drop table #compare

If you try running this on any SQL Server instance, you will note that:

  • The relationship CHECKSUM(VARCHAR) == CHECKSUM(NVARCHAR) holds (at least for the test string "1234" and all others that I have tried) for all collations other than the SQL Server sort order collations provided for backward compatibility with older SQL Server versions
  • CHECKSUM appears to always generate a different result with binary sort order collations than it does with dictionary sort order collations

So, the moral of the story is: if you use CHECKSUM to generate hash values for use in indices or comparisons, be very careful and consistent about the collation of the expressions you pass to CHECKSUM, or you may end up with obscure latent bugs.  

Published Oct 31 2006, 11:45 AM by chrisdi
Filed under:

Comments

 

valerym said:

Does this problem exist in SQL 2005?
November 3, 2006 6:09 PM
 

chrisdi said:

@valerym: Yes, I have run the query on both Sql Server 2005 and the previous version and the results are similar.

November 3, 2006 10:31 PM
 

valerym said:

I don't think that CHECKSUM function can be safely used for comparisons. Have a look at my old post dated Nov 2002: http://www.dbforums.com/showthread.php?t=402662
November 4, 2006 8:27 AM
 

chrisdi said:

@valerym:

You're quite right up to a point: as with any hash function, CHECKSUM(x) == CHECKSUM(y) does not imply x == y.

But CHECKSUM(x) != CHECKSUM(y) does imply x != y, so it can be useful in comparisons e.g. in WHERE clauses to offer the optimiser the possibility of filtering based on a computed INT value rather than comparing long character strings.  

November 6, 2006 3:06 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit