Charteris Community Server

Welcome to the Charteris Community
Welcome to Charteris Community Server Sign in | Join | Help
in Search

Chris Dickson's Blog

October 2006 - Posts

  • SQL Server: The effect of collation on CHECKSUM

    Diagnosing a bug in a SQL Server stored procedure recently I unearthed an interesting ‘feature’ of the CHECKSUM function which I couldn't find documented elsewhere.

    The stored procedure received some data as Unicode string arguments, and after some validation and processing stored it in tables defined with VARCHAR columns. (Yes, I know this sounds like a less than optimal design, but this was an application integration task and both the source data and the target database schema were defined by other applications). The problem was arising in another stored procedure which used the data from the tables, involving a join on a hash column which the first stored procedure had calculated using CHECKSUM. This join had the expected behaviour in most environments, but failed to generate any of the expected rows when executed on a particular SQL Server instance.

    The cause of the problem was eventually tracked down to the fact that the stored procedure code was not being completely fastidious about conversions between VARCHAR and NVARCHAR in the parts of the code where the CHECKSUM hash was being generated. On most SQL instances, this was not a problem since  CHECKSUM(<VARCHAR>) yielded exactly the same value as CHECKSUM(<NVARCHAR>) for the string values concerned. It transpired, however, that for certain collations this relationship does not hold true, and the SQL Server instance where the problem emerged happened to have a default collation which exposed this difference.

    The following query evolved during diagnosis of the problem, and illustrates where the potential pitfalls lie when using CHECKSUM in T-SQL code which may need to be portable between SQL Server instances with potentially differing collations:

    set nocount on

    create table #collations (name sysname, sql nvarchar(1000))

    create table #compare (collationName sysname, vcharHash int, nvcharHash int)

    insert #collations

    SELECT

    name,

    'SELECT ''' + name + ''', CHECKSUM(''1234'' COLLATE ' + name + '), CHECKSUM(N''1234'' COLLATE ' + name + ')'

    FROM ::fn_helpcollations()

     

    declare c cursor for select sql from #collations

    declare @sql nvarchar(1000)

     

    open c

    fetch next from c into @sql

    while @@FETCH_STATUS = 0

    begin

          -- exclude Hindi collations from consideration as they can only apply to NVARCHAR values

          if NOT @sql like 'SELECT ''Hindi_%'

          begin

                insert #compare

                exec sp_executesql @sql

          end

          fetch next from c into @sql

    end

     

    close c

    deallocate c

     

    print 'Collations where CHECKSUM(VARCHAR) differs from CHECKSUM(NVARCHAR)'

    select * from #compare where vcharHash != nvcharHash

     

    print 'Distinct CHECKSUM values where CHECKSUM(VARCHAR) is the same as CHECKSUM(NVARCHAR)'

    select case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END AS [Sort type],  MAX(vcharHash) as [Hash Value], COUNT(*) as [Number of collations] from #compare where vcharHash = nvcharHash

    group by vcharHash, case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END

     

    drop table #collations

    drop table #compare

    If you try running this on any SQL Server instance, you will note that:

    • The relationship CHECKSUM(VARCHAR) == CHECKSUM(NVARCHAR) holds (at least for the test string "1234" and all others that I have tried) for all collations other than the SQL Server sort order collations provided for backward compatibility with older SQL Server versions
    • CHECKSUM appears to always generate a different result with binary sort order collations than it does with dictionary sort order collations

    So, the moral of the story is: if you use CHECKSUM to generate hash values for use in indices or comparisons, be very careful and consistent about the collation of the expressions you pass to CHECKSUM, or you may end up with obscure latent bugs.  

  • Blogging

    Well, I've talked about doing it for some time... now some of my colleagues have taken away my remaining excuses (no time to set it up, etc etc) I can't avoid blogging any longer.

    Most of my work as a developer and architect is in the enterprise application integration arena, using Microsoft technologies, so expect to see posts about BizTalk, Windows Communication Foundation and other topics relevant to integration 'plumbing'.

    Posted Oct 31 2006, 11:41 AM by chrisdi with 1 comment(s)
    Filed under: