When profiling Excel workbook calculation time I often find that Lookups are one of the main causes of performance problems.
So this is the first of 2 posts looking at how to do Lookups faster.
This post concentrates on the difficulties in using Excel’s built-in functions efficiently, and the next post will look at how to develop UDFs that address some of these difficulties and do Lookups faster than the Excel built-in functions.
There are 2 basic search strategies for Lookups:
- Linear Search for unsorted data – Exact Match
- Binary Search for sorted data – Approximate Match
Uusually if your data is unsorted or your data may not contain the value you want to find you will use Exact Match:
Linear Search Lookup Difficulties
VLOOKUP(lookup_value, table_range, col_index, False)
MATCH(lookup_value, column_range, 0)
Using False as the last parameter in Vlookup, and 0 as the last parameter in Match tells these functions to do a Linear Exact Match. The function will start at the first row and look at all the rows in turn until it finds a value that matches the lookup value.
For large data ranges all these comparisons take a lot of calculation time.
Linear Search with missing data: Lookup Value not found
If all the rows have been searched and no value has been found the function returns #N/A.
If the lookup value exists the function will, on average, have to look at half the rows, but if the lookup value does not exist the function will have to look at all the rows.
If you are using Excel 2003 or earlier and you want to avoid the #N/A you can use 2 Lookups and ISERROR(), but that doubles the amount of rows to be looked at if the value exists.
IF(ISERROR(VLOOKUP(lookup_value, table_range, col_index, False),”not found”,VLOOKUP(lookup_value, table_range, col_index, False))
In Excel 2007 and later you can use IFERROR(), which avoids the double lookup.
IFERROR(VLOOKUP(lookup_value, table_range, col_index, False),”not found”)
Whole Column Lookups and the Used Range
If the size of the data you are looking up keeps changing/expanding its convenient to use a whole column as the table range. Excel’s Lookup functions are smart enough not to look in all the extra empty rows, but you have to be careful because they will scan down to the last row in the Used Range. So If you have excess formatting causing a large used range your whole-column Lookups will be super slow with missing data.
Large dependency ranges trigger unnecessary Lookups
Suppose your Lookup range is 10000 rows by 5 columns and you have 500 Vlookups of different values. If nothing changes in the Lookup Range then the 500 Lookups don’t get recalculated. But as soon as 1 single value in the 50000 cells of the Lookup Range changes, every single one of the 500 Lookups will be flagged as needing recalculation, even if 499 or 500 of them don’t need recalculating because they will give exactly the same answer as before.
And by the way this effect will ripple down to all the other formulas that are dependent on the 500 Lookups, because Excel does not stop the calculation dependency chain just because a formula result has not actually changed after a recalc.
Multiple Matching Values
Linear search always returns the first matching value found. There are other ways than LOOKUP of finding the nth or last matching value, but they are all slow.
Binary Search Difficulties: Lookups on Sorted Data
If you tell the Excel functions that your data is sorted (VLOOKUP(,,,TRUE) or -1 or 1 as the last parameter for MATCH) they will use a binary search algorithm.
The good news is that binary search is really fast compared to linear search (LOG2(N) compared to N/2)
The bad news is that the particular implementation of binary search used by Excel does not tell you if the value you are looking for cannot be found, instead it returns the highest value that is less than the lookup value.
Binary Search with missing data: the solution
Fortunately its easy to check for missing data using VLOOKUP or MATCH with sorted data: the trick is to use the Lookup_Value to lookup itself.
IF(VLOOKUP(Lookup_Value, Table_Range,1,TRUE)<>Lookup_Value,”Not Found”, VLOOKUP(Lookup_Value, Table_Range,Col_Index,TRUE))
The first VLOOKUP looks up the Lookup_Value in column 1 and returns the value from column 1.
If this does not equal the Lookup_Value then return “Not Found”, otherwise do another VLOOKUP but this time using the answer Column Index.
Doing 2 Binary Search lookups is faster than doing one Linear Search lookup with anything more than about 20 rows of data.
Whole Column Lookups and the Used Range
In the same way as Linear Search, Binary Search lookup also only searches the used rows when given whole columns. But because binary search is so efficient even searching 1 million rows is fast.
Large Dependency Ranges
Binary search Lookups and their dependents will also be recalculated unnecessarily, but since the actual Lookups are so fast it does not matter so much.
Multiple Matching Values
Excel’s binary search functions return the Last matching value with ascending sort, and the First with descending sort (Match(… -1)).
Using Excel’s Lookup functions efficiently
Here are some additional tips for using Excel’s functions efficiently:
VLOOKUP versus INDEX and MATCH or OFFSET.
I recommend using INDEX and MATCH.
VLOOKUP is slightly faster (approx. 5%), simpler and uses less memory than a combination of MATCH and INDEX or OFFSET.
However the additional flexibility offered by MATCH and INDEX often allows you to make significant timesaving compared to VLOOKUP.
INDEX is very fast and from Excel 97 onwards is a non-volatile function (speeds up recalculation).
OFFSET is also very fast, but it’s a volatile function.
Converting VLOOKUP to INDEX and MATCH.
These statements return the same answer:
Exact Match Lookups returning values from Multiple Columns.
You can often reuse a stored exact MATCH many times.
If you are doing exact lookups on multiple columns you can save a lot of time using one MATCH and many INDEX statements rather than many VLOOKUPs.
Add an extra column for the MATCH to store the result (stored_row).
For each column use:
INDEX(Lookup_Range,stored_row,column_number)Alternatively you can use VLOOKUP in an array formula: this example returns the value from the 2nd and 4th column in the lookup range.
Looking Up a Set of Contiguous Rows or Columns.
You can also return many cells from one Lookup operation.
If you want to lookup a number of contiguous columns then you can use INDEX in an array formula to return multiple columns at once (use 0 as the column number). You can also use INDEX to return multiple rows at once.
This returns columns A to J in the stored row created by a previous MATCH
Looking Up a Rectangular Block of Cells.
You can use MATCH and OFFSET to return a rectangular block of cells as a range.
Multi-dimensional lookup can also be done efficiently.
Two-dimensional table lookup using separate lookup’s on the rows and columns of a table can be efficiently done using an INDEX with two embedded MATCH functions.This example assumes a table in A1:Z1000 with column A containing the row identifier and row 1 containing the column identifier. Both the row and column identifiers are sorted ascending.
In large spreadsheets you often need to lookup using multiple indexes, such as looking up product volumes in a country.
The simple way to do this is to concatenate the indexes and lookup using concatenated lookup values. This is inefficient when the data is sorted for two reasons:
- Concatenating strings is a calculation-intensive operation.
- The lookup will cover a large range.
It is often more efficient to calculate a subset range for the lookup: for example by using COUNTIF to count the number of rows for each country and then calculating the first and last row for each country from the counts, and then looking up the product within that range. See SUMIF Example or the FastExcel sample problem for an example of using this technique.
If you need to lookup the table to use as well as the row and the column here are some techniques you can use, focussing on how to make Excel lookup/choose the table.
If each table you want to lookup (the third dimension) is stored as a set of range names, or as a table of text strings that represent ranges, then you may be able to use INDIRECT or CHOOSE.
Using CHOOSE and range names can be a very efficient method, and it is not volatile, but it is best suited to only a small number of tables:
The example above dynamically uses TableLookup_Value to choose which range name (TableName1, TableName2, …) to use for the lookup table.
INDEX(INDIRECT(“Sheet” & TableLookup_Value & “!$B$2:$Z$1000″),MATCH(RowLookup_Value,$A$2:$A$1000),MATCH(colLookup_value,$B$1:$Z$1))
This example uses INDIRECT and TableLookup_Value to dynamically create the sheet name to use for the lookup table. This method has the advantage of being simple and can handle a large number of tables, but because INDIRECT is a volatile function the lookup will be calculated at every calculation even if none of the data has changed.
You could also use VLOOKUP to find the name of the sheet or the text string to use for the table, and then use INDIRECT to convert the resulting text into a range:
Another technique is to aggregate all your tables into one giant table, but with an additional column which identifies the individual tables. You can then use the techniques for multiple-index lookup above.
AVLOOKUP, AMATCH, MATCH,VLOOKUP and HLOOKUP allow you to use the wildcard characters ? (Any single character) and * (no character or any number of characters) on alphabetic exact matches. Sometimes this can avoid multiple matches.
- Linear Search Lookups are slow.
- Sorting your data and using the Binary Search options on VLOOKUP and MATCH is very fast.
- It’s easy and fast to detect missing values using Binary Search on sorted data.
- Lookups with large Table_Ranges often get recalculated unnecessarily.
- Lookups on whole columns are inefficient with large excess used range, but this does not matter with Binary Search.
- Excel’s Lookup functions do not provide good solutions with data containing multiple matching values.
For more information on speeding up Lookups see my Lookups Page
The next post will look at ways of bypassing and simplifying these difficulties by developing a UDF.