snowflake-cloud-data-platform - 将 SQL Server 中的字符串 values 与 Snowflake 进行比较

我正在尝试比较 Snowflake 和 SQL Server 之间的字符串 values 。我在比较 UNICODE 字符时遇到问题。 SQL Server MD5 哈希算法产生的结果与 Snowflake 不同。

为了比较目的,解决这种差异的最佳方法是什么?

示例代码

SQL 服务器

/*  SQL SERVER  
        LOWER and CONVERT are used to produce the exact HASH format as Snowflake
*/
SELECT 
LOWER(
    CONVERT(VARCHAR(1000), 
        HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)))
    , 2)
) AS mismatch;
SELECT 
LOWER(
    CONVERT(VARCHAR(1000), 
        HASHBYTES('MD5', CAST('md5_algtest' AS VARCHAR(50)))
    , 2)
) AS matches;

雪花

/*  SNOWFLAKE   */
SELECT md5('md5_alg“test”') AS mismatch;
SELECT md5('md5_algtest') AS match;

回答1

Microsoft SQL Server 对 store unicode 字符使用 UTF-16 编码。 Snowflake stores UTF-8 字符集中的所有数据。

因此,您需要将 'md5_alg“test”' 转换为 UTF-8 并计算哈希值。

我找到了执行此操作的函数:https://gist.github.com/sevaa/f084a0a5a994c3bc28e518d5c708d5f6

create function [dbo].[ToUTF8](@s nvarchar(max))
returns varbinary(max)
as
begin
    declare @i int = 1, @n int = datalength(@s)/2, @r varbinary(max) = 0x, @c int, @c2 int, @d varbinary(4)
    while @i <= @n
    begin
        set @c = unicode(substring(@s, @i, 1))
        if (@c & 0xFC00) = 0xD800
        begin
            set @i += 1
            if @i > @n
                return cast(cast('Malformed UTF-16 - two nchar sequence cut short' as int) as varbinary)
            set @c2 = unicode(substring(@s, @i, 1))
            if (@c2 & 0xFC00) <> 0xDC00
                return cast(cast('Malformed UTF-16 - continuation missing in a two nchar sequence' as int) as varbinary)
            set @c = (((@c & 0x3FF) * 0x400) | (@c2 & 0x3FF)) + 0x10000
        end

        if @c < 0x80
            set @d = cast(@c as binary(1))
        if @c >= 0x80 and @c < 0x800 
            set @d = cast(((@c * 4) & 0xFF00) | (@c & 0x3F) | 0xC080 as binary(2))
        if @c >= 0x800 and @c < 0x10000
            set @d = cast(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xe08080 as binary(3))
        if @c >= 0x10000
            set @d = cast(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0x3F0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xf0808080 as binary(4))
            
        set @r += @d
        set @i += 1
    end
    return @r
end

创建此函数后,您可以计算 MD5,它会生成与 Snowflake 相同的 value:

SELECT 
LOWER(
    CONVERT(VARCHAR(32), 
        HASHBYTES('MD5', [dbo].[ToUTF8]('md5_alg“test”')  )
    , 2)
) AS mismatch,
LOWER(
    CONVERT(VARCHAR(32), 
        HASHBYTES('MD5',  [dbo].[ToUTF8]('md5_algtest')  )
    , 2)
) AS matches;
mismatch matches
80381678898496aba31245b01f40dd25 cb95937a11e610f6aaf0d06666bde771

回答2

对于 SQL Server 2019 转发,以下解决方案适用于我。

SELECT 
LOWER(
    CONVERT(VARCHAR(1000), 
        HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)))
    , 2)
) AS mismatch,
LOWER(
    CONVERT(VARCHAR(1000), 
        HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8)
    , 2)
) AS match

https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-utf-8-support-for-sql-server/ba-p/734928