Returning to the subject of finding the missmatches between 2 lists I want to compare using VLOOKUP with using SpeedTools COMPARE.LISTS.
My test data consists of 2 lists of 500000 6-digit numbers. Most of these numbers match, but 5000 of them are different. The lists are not sorted. What I want to do is:
- Filter out the missmatches showing ** for each miss
- Count the missmatches
- Produce a list of the missmatches
The first list is in A2:A500001 and the second list is in D2:D500001
Using SpeedTools COMPARE.LISTS
COMPARE.LISTS allows you to control what kind of output you want from the comparison.
- A count of either the matches or the miss-matches
- Either True/False or Blank/** (** means not found)
- A count of the matches and a list of the matches
- A count of the miss-matches and a list of the missmatches
To get just the count of miss-matches you enter COMPARE.LISTS into a single cell as an ordinary (non-array) formula:
This formula looks for each of the cells in D2:D500001 in the list A2:A5000001 and counts the number of items that can’t be found (5000 in this case).
And it only takes 0.6 seconds on my PC!
Thats fast enough for you to add the formula as a safety check that all items match.
To get a count and a list of the missmatches you enter the same formula as a multi-cell array formula (select a vertical range of cells, type the formula in the formula bar and press Control-Shift-Enter). The count appears in the first row and the following rows conatin the list of missing items. And it only takes 0.7 seconds on my PC!
To filter out the rows containing the missmatches you enter the following formula into the 500000 cells in E2:E500001 as a multi-cell array formula, and then filter for **:
This formula checks each of the cells in D2:D500001 against the range A2:A500001 and returns either blank for a hit or ** for a miss to the 500000 cels in E2:E500001. And it only takes 0.9 seconds on my PC! Thats fast enough for you to make corrections and recalculate until all the errors are fixed.
You can also get counts and lists of matching items as well as missing items.
And there is an option to a case-sensistive text compare if you need to find missmatches caused by upper-lower case differences.
I can use the unsorted (exact match) range lookup option of VLOOKUP for each number I want to check, and it will return #N/A if it can’t be found in the other list: or I can check for the #N/A error and show ** for the missmatches.
and copy down for 500000 rows.
This works, but its very slow: it takes over 500 seconds on my quad-core machine even with multi-threaded calculation, and on a single core machine it would be up to 4 times slower!
But thats not surprising if you think about how many MXIPS (millions of eXcel instructions per second) this is using.
Each VLOOKUP that finds a match is doing a linear search and on average is comparing with 250000 rows (and the ones that don’t have a match are comparing all 50000 rows): so for 500000 VLOOKUPs thats roughly 500000 x 250000 = 125000 million compares in 500 seconds = 250 MXIPS.
Then I can use autofilter to filter on ** to show only the rows with missmatches.
and use COUNTIF to count the missmatches:
Because COUNTIF treats * as a wild-card character I need to add a
~ to stop this happening.
Using VLOOKUP this way works OK for small amounts of data but is just not practical for large numbers of rows.
Its probably possible to create an array formula that just gives the count of miss-matches but I expect it would be too slow to be useful.
And I am sure someone cleverer than me can create an array formula to find case-sensitive missmatches.
My objective for COMPARE.LISTS was to create an easy-to-use and fast function that enabled you to quickly find and fix data missmatches. The performance has exceeded my expectations.