Editorials

Fuzzy Comparison

For decades Microsoft SQL Server supports one of the coolest comparison operators available; SOUNDEX. SOUNDEX is used to convert strings into a representative code, allowing words with different spellings to be considered the same. This is very useful in imprecise languages, such as English, having multiple ways to spell the same thing.

The way the SOUNDEX function works is that it strips out all the vowels from a word, and creates a code based on the remaining consonants. For example, the following words all have the same SOUNDEX generated code:

SELECT SOUNDEX('TAYLOR'), SOUNDEX('TAILOR'), SOUNDEX('TYLER')

The result for each is T460. This makes sense because they all have the same consonants in the same order. Stripping out the vowel characters A, Y, O, I and E where they are found results in TLR. In my collation which is not case specific, the results of SOUNDEX are identical, regardless of case. I don’t know how it works when you use a collation that considers capital and lower case letters to be different.

If you are going to use SOUNDEX as a comparison method, you will most likely want to convert your terms into a column having the SOUNDEX converted value, so that your comparisons may use an index created on the column. If you want to convert your search term using SOUNDEX, and then convert words from your database using SOUNDEX at the time of the query, an index will not be used, as demonstrated in this example

SELECT *

FROM Persons

WHERE SOUNDEX(LastName) = SOUNDEX(‘TAYLOR’)

Using this technique, SOUNDEX is not really the comparison operator. We are using Equals as the comparison. SOUNDEX is simply massaging the string to another form for comparison. There is, however, another comparison operation designed specifically to be used in conjunction with SOUNDEX. The comparison operation is DIFFERENCE.

According to Technet, the DIFFERENCE comparison takes two SOUNDEX values as input, and returns a probability of sameness on a scale from 0 to 4. The higher the number, the more likely to two different strings sound alike.

I ran a number of different comparisons of the DIFFERENCE function and was only able to produce a result of 3 or 4. Regardless of how different two words were for compairison, they all came out 3.

Here are some of my samples with a result of 3 or 4, but nothing lower. Some of these words have a very different sound, and would expect a lower score using the DIFFERENCE function.

Difference Comparison Results
Test Value SOUNDEX SOUNDEX DIFFERENCE
DIFFERENCE(SOUNDEX(‘NO’), SOUNDEX(‘SMITH’)) N000 S530 3
DIFFERENCE(SOUNDEX(‘TAYLOR’), SOUNDEX(‘TATE’)) T460 T300 4
DIFFERENCE(SOUNDEX(‘HE’), SOUNDEX(‘UNTIL’)) H000 U534 3
DIFFERENCE(SOUNDEX(‘Horse’), SOUNDEX(‘House’)) H620 H200 4
DIFFERENCE(SOUNDEX(‘Fly’), SOUNDEX(‘Flee’)) F400 F400 4
DIFFERENCE(SOUNDEX(‘House’), SOUNDEX(‘Fly’)) H200 F400 3

From my limited comparison, it appears that anything under 4 is not relevant. I made these tests using SQL Server 2012. I’ll try using some different versions and see if I get any different results and post those in comments later.

Based on these comparisons, it appears you could simply compare the SOUNDEX calculated values using the EQUALS comparison, and get a large majority of similar words. If there is more differentiation in DIFFERENCE than what I am seeing, then it would be worthwhile to use that method instead. However, if you are using it against a set of data, once again, an index would not be take into consideration. It would have to evaluate each row individually. So, if you wanted all the rows having a difference of 3 or greater, then it would have to calculate each row, and then do the comparison.

Do you have an application needing to have a less strict comparison to find words that have a similar spelling. Perhaps SOUNDEX is for you. Are you using SOUNDEX or DIFFERENCE? Share your experience in a comment or by Email to btaylor@sswug.org.

Cheers,

Ben