I often run into the problem of having to compare two lists in Excel, to see what items are in the list to look for that can’t be found in the list to look in.
So when I saw Dick Kusleika’s post on Daily Dose of Excel I thought it was time I got my act together and worked out the best way of tackling this problem.
I decided to write a VBA UDF called IsInList2 that called 6 methods:
- Sorting the the LookIn list and using Binary Search to compare the items in the LookFor list
- Using linear search of the LookIn list for each item in the LookFor list
- Create a Collection containing the LookIn list and check it for each item in the LookFor list
- Create a Dictionary containing the LookIn list and check it for each item in the LookIn list
- Use an already sorted LookIn List and Binary Search
- Look for partial matches using InStr
Overview of the IsInList 2 UDF
The IsInList2 UDF is written as an array function returning an array of True/False. Its designed to be called as a multi-cell array function entered in the column next to the LookFor list, so that you can Filter for False to find all the items in the LookFor list that don’t exits in the LookIn list.
For simplicity the UDF is written assuming that both lists are Ranges with at least 2 items: so the first task is to get the values from the Ranges into variant arrays.
(I have ignored the optimisation of using MATCH directly on the Range).
Then the output array is created as the smaller of the calling cells and the the LookFor list
Then if its an exact match the data is either Sorted, added to a Collection or to a Dictionary
Then the function iterates through the LookFor list using the appropriate method subroutine and stores the result in the output array
The QuickSort Sub
Here is the code for a QuickSort of a variant containing an array. The Quicksort would be substantially faster if it was strongly typed and handled a vector rather than a 1-column matrix.
Sub QSortVar(InputValues As Variant, jStart As Long, jEnd As Long)
Dim jStart2 As Long
Dim jEnd2 As Long
Dim v1 As Variant
Dim v2 As Variant
jStart2 = jStart
jEnd2 = jEnd
‘
‘ choose random pivot
‘
v1 = InputValues((jStart + (jEnd – jStart) * Rnd()), 1)
While jStart2 < jEnd2
While InputValues(jStart2, 1) < v1 And jStart2 < jEnd
jStart2 = jStart2 + 1
Wend
While InputValues(jEnd2, 1) > v1 And jEnd2 > jStart
jEnd2 = jEnd2 – 1
Wend
If jStart2 < jEnd2 Then
v2 = InputValues(jStart2, 1)
InputValues(jStart2, 1) = InputValues(jEnd2, 1)
InputValues(jEnd2, 1) = v2
End If
If jStart2 <= jEnd2 Then
jStart2 = jStart2 + 1
jEnd2 = jEnd2 – 1
End If
Wend
If jEnd2 > jStart Then QSortVar InputValues, jStart, jEnd2
If jStart2 < jEnd Then QSortVar InputValues, jStart2, jEnd
End Sub
The Binary Search Function
For simplicity this function is also written to handle a variant containing an array, and would be faster if more strongly typed.
Its not designed as a UDF: its called from IsInList2.
Function BSearchMatch(LookupValue As Variant, LookupArray As Variant) As Boolean
‘
‘ use binary search to find if lookupvalue exists in lookuparray
‘
Dim low As Long
Dim high As Long
Dim middle As Long
Dim varMiddle As Variant
Dim jRow As Long
‘
jRow = 1
BSearchMatch = False
low = 0
high = UBound(LookupArray)
Do While high – low > 1
middle = (low + high) \ 2
varMiddle = LookupArray(middle, 1)
If varMiddle >= LookupValue Then
high = middle
Else
low = middle
End If
Loop
‘
If (low > 0 And high <= UBound(LookupArray)) Then
If LookupArray(high, 1) > LookupValue Then
jRow = low
Else
jRow = high
End If
End If
If LookupValue = LookupArray(jRow, 1) Then BSearchMatch = True
End Function
The Exact Linear Search Function
LMatchExactV is not designed as a UDF: its called from IsInList2.
Function LMatchExactV(LookupValue As Variant, LookupArray As Variant) As Boolean
‘
‘ use linear search to find if LookupValue exists in LookupArray
‘ LookupArray must be a a 2-dimensional variant array of N rows and 1 column
‘
Dim j As Long
‘
LMatchExactV = False
For j = 1 To UBound(LookupArray)
If LookupValue = LookupArray(j, 1) Then
LMatchExactV = True
Exit For
End If
Next j
End Function
The Partial Match Linear Search Function
This function use InStr to check for partial match. It is not designed as a UDF: its called from IsInList2.
Function LMatchInV(LookupValue As Variant, LookupArray As Variant) As Boolean
‘
‘ use linear search and Instr to find if LookupValue is within any value in LookupArray
‘ LookupArray must be a a 2-dimensional variant array of N rows and 1 column
‘
Dim j As Long
Dim strLook As String
‘
LMatchInV = False
strLook = CStr(LookupValue)
For j = 1 To UBound(LookupArray)
If InStr(LookupArray(j, 1), strLook) > 0 Then
LMatchInV = True
Exit For
End If
Next j
End Function
The IsInList2 Array Function
Because the function uses the Dictionary Object from VBScript you need to add a reference to the Microsoft Scripting Runtime.
The Function takes 2 optional parameters which control the method used:
jSorted – which Sort/Lookup?Match method to use
FindExact – True for an exact match, False for a partial match
Public Function IsInList2(LookFor As Variant, LookIn As Variant, Optional jSorted As Long = 0, Optional FindExact As Boolean = True)
‘
‘ jSorted
‘ =0 data not sorted but sort it
‘ =1 data already sorted – use binary search
‘ =-1 use linear search
‘ =2 use collection
‘ =3 use dictionary
‘
Dim nLookFor As Long
Dim nLookIn As Long
Dim nOut As Long
Dim vOut() As Variant
Dim j As Long
Dim coll As New Collection
Dim dict As New dictionary
‘
‘ coerce ranges to values
‘
LookFor = LookFor.Value2
LookIn = LookIn.Value2
‘
‘ get row counts
‘
nLookFor = UBound(LookFor)
nLookIn = UBound(LookIn)
nOut = Application.Caller.Rows.Count
‘
If nLookFor < nOut Then nOut = nLookFor
ReDim vOut(nOut, 1) ‘ create output array
‘
If FindExact Then
If jSorted = 0 Then
QSortVar LookIn, LBound(LookIn), UBound(LookIn) ‘ quicksort
jSorted = 1
ElseIf jSorted = 2 Then
On Error Resume Next
For j = 1 To nLookIn
coll.Add LookIn(j, 1), CStr(LookIn(j, 1)) ‘ collection
Next j
ElseIf jSorted = 3 Then
On Error Resume Next
For j = 1 To nLookIn
dict.Add LookIn(j, 1), LookIn(j, 1) ‘ dictionary
Next j
End If
On Error GoTo 0
End If
‘
For j = 1 To nOut
If Not FindExact Then
vOut(j, 1) = LMatchInV(LookFor(j, 1), LookIn) ‘instr linear search
ElseIf jSorted = 1 Then
vOut(j, 1) = BSearchMatch(LookFor(j, 1), LookIn) ‘ binary search
ElseIf jSorted = 2 Then
‘
‘ use collection
‘
On Error Resume Next
Err.Clear
vOut(j, 1) = coll.Item(CStr(LookFor(j, 1)))
If CLng(Err.Number) = 5 Then
vOut(j, 1) = False
Else
vOut(j, 1) = True
End If
ElseIf jSorted = 3 Then
vOut(j, 1) = dict.Exists(LookFor(j, 1)) ‘ Dictionary
ElseIf jSorted = -1 Then
vOut(j, 1) = LMatchExactV(LookFor(j, 1), LookIn) ‘ exact linear
Else
vOut(j, 1) = CVErr(xlErrValue)
End If
Next j
‘
IsInList2 = vOut ‘ return output
End Function
Performance Comparison
The performance tests are all done using ranges containing a character followed by 5 digit random integer numbers.
The timings are all in Milliseconds.
LookFor and LookIn give the number of rows.
Linear Search timings on smaller numbers of rows are slightly pessimistic because of a high % of miss-matches
BSearchOnly gives timings for Binary Search on already sorted data.
Partial gives timings for Linear Search with partial matching using InStr.
Because Excel VBA array formulas cannot return more than 64K rows you cannot use the function directly for more than 64K rows of LookFor without using more than one array formula.
|
LookFor
|
LookIn
|
Linear
|
QSortBsearch
|
Collection
|
Dict
|
BsearchOnly
|
Partial
|
| 2 |
2 |
0.29
|
0.29
|
0.29
|
0.46
|
0.29
|
0.29
|
| 50 |
50 |
4.42
|
4.14
|
4.05
|
4.06
|
3.91
|
4.43
|
| 200 |
200 |
12.2
|
5.59
|
5.08
|
4.40
|
4.49
|
14.2
|
| 500 |
500 |
55.5
|
8.9
|
7.3
|
4.9
|
5.66
|
68.0
|
| 2000 |
2000 |
824
|
27.0
|
19.9
|
8.03
|
12.2
|
995
|
| 50000 |
50000 |
-
|
800
|
515
|
252
|
279
|
-
|
| 50000 |
200000 |
-
|
2583
|
1290
|
1044
|
350
|
-
|
| 50000 |
1048756 |
-
|
14204
|
6720
|
23650
|
-
|
-
|
Overall the sequence from fastest method to slowest method is:
- Dictionary
- Binary Search Only (but the data has to already be sorted)
- Collection
- Quick Sort plus Binary Search
- Linear Search
- Partial Linear Search
Conclusion
The best method is to use the VBScript Dictionary object, unless you have more than 500K rows after which the Collection object starts to win.
The coding for Dictionary is very simple and it has an Exists method which Collection lacks.
Below about 2000 rows you probably won’t notice the difference between any of the methods except for Linear search.
So now I just have to try the C++ XLL version … I reckon that will beat Dictionary! (Yes, looks like XLL wins: 1923 millisecs for 50000 in 1048756)