Comparing Two Lists – VBA UDF shootout between Linear Search, Binary Search, Collection and Dictionary

I often run into the problem of having to compare two lists in Excel, to see what items are in the list to look for that can’t be found in the list to look in.

So when I saw Dick Kusleika’s post on Daily Dose of Excel I thought it was time I got my act together and worked out the best way of tackling this problem.

I decided to write a VBA UDF called IsInList2 that called 6 methods:

  • Sorting the the LookIn list and using Binary Search to compare the items in the LookFor list
  • Using linear search of the LookIn list for each item in the LookFor list
  • Create a Collection containing the LookIn list and check it for each item in the LookFor list
  • Create a Dictionary containing the LookIn list and check it for each item in the LookIn list
  • Use an already sorted LookIn List and Binary Search
  • Look for partial matches using InStr

Overview of the IsInList 2 UDF

The IsInList2 UDF is written as an array function returning an array of True/False. Its designed to be called as a multi-cell array function entered in the column next to the LookFor list, so that you can Filter for False to find all the items in the LookFor list that don’t exits in the LookIn list.

For simplicity the UDF is written assuming that both lists are Ranges with at least 2 items: so the first task is to get the values from the Ranges into variant arrays.
(I have ignored the optimisation of using MATCH directly on the Range).

Then the output array is created as the smaller of the calling cells and the the LookFor list

Then if its an exact match the data is either Sorted, added to a Collection or to a Dictionary

Then the function iterates through the LookFor list using the appropriate method subroutine and stores the result in the output array

The QuickSort Sub

Here is the code for a QuickSort of a variant containing an array. The Quicksort would be substantially faster if it was strongly typed and handled a vector rather than a 1-column matrix.

Sub QSortVar(InputValues As Variant, jStart As Long, jEnd As Long)
Dim jStart2 As Long
Dim jEnd2 As Long
Dim v1 As Variant
Dim v2 As Variant
jStart2 = jStart
jEnd2 = jEnd

‘ choose random pivot

v1 = InputValues((jStart + (jEnd – jStart) * Rnd()), 1)
While jStart2 < jEnd2
While InputValues(jStart2, 1) < v1 And jStart2 < jEnd
jStart2 = jStart2 + 1
Wend
While InputValues(jEnd2, 1) > v1 And jEnd2 > jStart
jEnd2 = jEnd2 – 1
Wend
If jStart2 < jEnd2 Then
v2 = InputValues(jStart2, 1)
InputValues(jStart2, 1) = InputValues(jEnd2, 1)
InputValues(jEnd2, 1) = v2
End If
If jStart2 <= jEnd2 Then
jStart2 = jStart2 + 1
jEnd2 = jEnd2 – 1
End If
Wend
If jEnd2 > jStart Then QSortVar InputValues, jStart, jEnd2
If jStart2 < jEnd Then QSortVar InputValues, jStart2, jEnd
End Sub

The Binary Search Function

For simplicity this function is also written to handle a variant containing an array, and would be faster if more strongly typed.
Its not designed as a UDF: its called from IsInList2.

Function BSearchMatch(LookupValue As Variant, LookupArray As Variant) As Boolean

‘ use binary search to find if lookupvalue exists in lookuparray

Dim low As Long
Dim high As Long
Dim middle As Long
Dim varMiddle As Variant
Dim jRow As Long

jRow = 1
BSearchMatch = False
low = 0
high = UBound(LookupArray)
Do While high – low > 1
middle = (low + high) \ 2
varMiddle = LookupArray(middle, 1)
If varMiddle >= LookupValue Then
high = middle
Else
low = middle
End If
Loop

If (low > 0 And high <= UBound(LookupArray)) Then
If LookupArray(high, 1) > LookupValue Then
jRow = low
Else
jRow = high
End If
End If
If LookupValue = LookupArray(jRow, 1) Then BSearchMatch = True
End Function

The Exact Linear Search Function

LMatchExactV is not designed as a UDF: its called from IsInList2.

Function LMatchExactV(LookupValue As Variant, LookupArray As Variant) As Boolean

‘ use linear search to find if LookupValue exists in LookupArray
‘ LookupArray must be a a 2-dimensional variant array of N rows and 1 column

Dim j As Long

LMatchExactV = False
For j = 1 To UBound(LookupArray)
If LookupValue = LookupArray(j, 1) Then
LMatchExactV = True
Exit For
End If
Next j
End Function

The Partial Match Linear Search Function

This function use InStr to check for partial match. It is not designed as a UDF: its called from IsInList2.

Function LMatchInV(LookupValue As Variant, LookupArray As Variant) As Boolean

‘ use linear search and Instr to find if LookupValue is within any value in LookupArray
‘ LookupArray must be a a 2-dimensional variant array of N rows and 1 column

Dim j As Long
Dim strLook As String

LMatchInV = False
strLook = CStr(LookupValue)
For j = 1 To UBound(LookupArray)
If InStr(LookupArray(j, 1), strLook) > 0 Then
LMatchInV = True
Exit For
End If
Next j
End Function

The IsInList2 Array Function

Because the function uses the Dictionary Object from VBScript you need to add a reference to the Microsoft Scripting Runtime.

The Function takes 2 optional parameters which control the method used:

jSorted – which Sort/Lookup?Match method to use

FindExact – True for an exact match, False for a partial match

Public Function IsInList2(LookFor As Variant, LookIn As Variant, Optional jSorted As Long = 0, Optional FindExact As Boolean = True)

‘ jSorted
‘   =0 data not sorted but sort it
‘   =1 data already sorted – use binary search
‘   =-1 use linear search
‘   =2 use collection
‘   =3 use dictionary

Dim nLookFor As Long
Dim nLookIn As Long
Dim nOut As Long
Dim vOut() As Variant
Dim j As Long
Dim coll As New Collection
Dim dict As New dictionary

‘ coerce ranges to values

LookFor = LookFor.Value2
LookIn = LookIn.Value2

‘ get row counts

nLookFor = UBound(LookFor)
nLookIn = UBound(LookIn)
nOut = Application.Caller.Rows.Count

If nLookFor < nOut Then nOut = nLookFor
ReDim vOut(nOut, 1)  ‘ create output array

If FindExact Then
If jSorted = 0 Then
QSortVar LookIn, LBound(LookIn), UBound(LookIn)    ‘ quicksort
jSorted = 1
ElseIf jSorted = 2 Then
On Error Resume Next
For j = 1 To nLookIn
coll.Add LookIn(j, 1), CStr(LookIn(j, 1))   ‘ collection
Next j
ElseIf jSorted = 3 Then
On Error Resume Next
For j = 1 To nLookIn
dict.Add LookIn(j, 1), LookIn(j, 1)    ‘ dictionary
Next j
End If
On Error GoTo 0
End If

For j = 1 To nOut
If Not FindExact Then
vOut(j, 1) = LMatchInV(LookFor(j, 1), LookIn)    ‘instr linear search
ElseIf jSorted = 1 Then
vOut(j, 1) = BSearchMatch(LookFor(j, 1), LookIn)    ‘ binary search
ElseIf jSorted = 2 Then

‘ use collection

On Error Resume Next
Err.Clear
vOut(j, 1) = coll.Item(CStr(LookFor(j, 1)))
If CLng(Err.Number) = 5 Then
vOut(j, 1) = False
Else
vOut(j, 1) = True
End If
ElseIf jSorted = 3 Then
vOut(j, 1) = dict.Exists(LookFor(j, 1))    ‘ Dictionary
ElseIf jSorted = -1 Then
vOut(j, 1) = LMatchExactV(LookFor(j, 1), LookIn)    ‘ exact linear
Else
vOut(j, 1) = CVErr(xlErrValue)
End If
Next j

IsInList2 = vOut    ‘ return output
End Function

Performance Comparison

The performance tests are all done using ranges containing a character followed by 5 digit random integer numbers.

The timings are all in Milliseconds.
LookFor and LookIn give the number of rows.

Linear Search timings on smaller numbers of rows are slightly pessimistic because of a high % of miss-matches
BSearchOnly gives timings for Binary Search on already sorted data.
Partial gives timings for Linear Search with partial matching using InStr.

Because Excel VBA array formulas cannot return more than 64K rows you cannot use the function directly for more than 64K rows of LookFor without using more than one array formula.

LookFor

LookIn

Linear

QSortBsearch

Collection

Dict

BsearchOnly

Partial

2 2

0.29

0.29

0.29

0.46

0.29

0.29

50 50

4.42

4.14

4.05

4.06

3.91

4.43

200 200

12.2

5.59

5.08

4.40

4.49

14.2

500 500

55.5

8.9

7.3

4.9

5.66

68.0

2000 2000

824

27.0

19.9

8.03

12.2

995

50000 50000

-

800

515

252

279

-

50000 200000

-

2583

1290

1044

350

-

50000 1048756

-

14204

6720

23650

-

-

Overall the sequence from fastest method to slowest method is:

  • Dictionary
  • Binary Search Only (but the data has to already be sorted)
  • Collection
  • Quick Sort plus Binary Search
  • Linear Search
  • Partial Linear Search

Conclusion

The best method is to use the VBScript Dictionary object, unless you have more than 500K rows after which the Collection object starts to win.

The coding for Dictionary is very simple and it has an Exists method which Collection lacks.

Below about 2000 rows you probably won’t notice the difference between any of the methods except for Linear search.

So now I just have to try the C++ XLL version … I reckon that will beat Dictionary! (Yes, looks like XLL wins: 1923 millisecs for 50000 in 1048756)

About these ads

12 Responses to “Comparing Two Lists – VBA UDF shootout between Linear Search, Binary Search, Collection and Dictionary”

  1. Ross Says:

    This is a great bit of work Charles,

    I can now give myself a pat on the back, as I correctly picked the collection class for my 1million rows project! I must have the coding instincts of some sort of binary spiderman, I was biten by a radio active byte…

    or i just lucked out!!!

    Thanks Charles, as always excellent post!

    • fastexcel Says:

      Hmm… If it was me I would have said too lazy to add the VBScripting reference and lookup the Dictionary syntax …

      • Ross Says:

        Well truth be told, I would have liked to use as Dictionary, but I wanted to use as little “none native” as possible… although in the same app I’ll have to install an .OXC, so I’m not sure what sense that makes!

  2. jeff Weir Says:

    Hi Charles. Do you have a worksheet you could post with this set up somewhere?

    I cut and pasted the code into a module, and cleaned the code, but for some reason I can’t get any of the approaches to return any matches, even though I know there are matches.

    Reason I’m interested is i have a couple of alternative approaches that I’d like to try out.

    Hope you can help

    Jeff

  3. jeff Weir Says:

    Ahhh…I didn’t realise that you need an OPTION BASE 1 statement in the module. Now it seems to be working. Still, let me know if you can share a workbook.
    Cheers

    • fastexcel Says:

      I always (well nearly always) use Option Base 1 in VBA because most Excel objects apart from Forms use 1-base. But zero-based everything in C … OK so I lack consistency.

  4. jeff Weir Says:

    Me again. I get a type mismatch on UBound(LookupArray) for BSearchMatch.

  5. jeff Weir Says:

    Okay, I get BSearchMatch to work as a UDF if I make these two changes:

    Function BSearchMatch(LookupValue As Range, LookupRange As Range) As Boolean

    LookupArray = LookupRange

    i.e. explicitly type the input as a range, and then send that range to a variant array.

    Am I right that as you wrote the function above, it WONT work as a UDF?

    Sorry for the questions

    • fastexcel Says:

      Yes the only function written as a UDF is IsInList2 and there is no error checking at all: its all designed as a speed test rather than production code!!! So IsInList2 only works when fed Multi-cell ranges, and there is no checking whether the functions are given ranges or arrays or vectors or scalars etc etc. And the BSearch routine assumes a binary case-sensitive Quicksort rather than an Excel Sort, and it probably won’t work with Unicode etc.

  6. jeff Weir Says:

    Okay, my apologies…I didn’t understand until just now that BSearchMatch is NOT one of the tests, but instead is a function CALLED by one of the tests. Sorry for all the comments. Feel free to delete them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 34 other followers

%d bloggers like this: