Formula References between Sheets versus within Sheets shootout: Which calculates faster and uses more Memory

I thought I would revisit the differences between formulas that reference other worksheets and formulas that only reference their own worksheet. Referencing other worksheets always used to be a memory hog, but so much has changed between Excel 2003 and Excel 2013 that I wanted to see the current status.

The Test Workbooks

The test workbooks are all generated by simple VBA code contained in thw MakeInterLinkedSheets.xlsb workbook, which you can down load from here.

Generating Many Linked Worksheets

The code to generate the interlinked worksheets is shown below. You can choose how many worksheets to generate, and then each worksheet will contain a column of constants and a column of formulas that refer to each of the other worksheets. So if you choose 1500 worksheets each worksheet will contain 1500 formulas with every formula referring to a different worksheet (you can’t get much more linked than that!). Thats a total of 2.25 million formulas.


Sub MakeManyLinkedSheets()
 '
 ' make a large number of worksheets, each of which links to all of the others
 '
 Dim j As Long
 Dim k As Long
 Dim varSheets As Variant
 Dim nSheets As Long
 Dim nRequest As Long
 Dim nAdd As Long
 Dim var() As Variant
 '
 varSheets = Application.InputBox("Enter the Number of Interlinked Sheets to Generate", "Inter-Linked Sheets", 1500)
 If Not IsNumeric(varSheets) Then
 MsgBox "Input must be a number, MakeManyLinkedSheets cancelled", vbOKOnly + vbCritical
 Exit Sub
 Else
 Application.ScreenUpdating = False
 Application.Calculation = xlManual
 '
 nRequest = CLng(varSheets)
 '
 ' add sheets: cannot add more than 255 in one .Add statement
 '
 nSheets = ActiveWorkbook.Worksheets.Count
 nRequest = nRequest - nSheets
 Application.StatusBar = "Adding Sheets"
 Do While nRequest > 0
 nAdd = nRequest
 If nAdd > 255 Then nAdd = 255
 ActiveWorkbook.Worksheets.Add before:=ActiveWorkbook.Worksheets(nSheets), Count:=(nAdd)
 nSheets = ActiveWorkbook.Worksheets.Count
 nRequest = nRequest - nAdd
 Loop
 '
 ' add constant and linkage formula
 '
 For j = 1 To ActiveWorkbook.Worksheets.Count
 Application.StatusBar = "Generating Linkages on Sheet " & CStr(j)
 ReDim var(ActiveWorkbook.Worksheets.Count, 2)
 For k = 1 To ActiveWorkbook.Worksheets.Count
 var(k, 1) = j * k
 var(k, 2) = "=Sheet" & CStr(k) & "!a" & CStr(k)
 Next k
 Worksheets(j).Range("a1").Resize(ActiveWorkbook.Worksheets.Count, 2).Formula = var
 Next j
 Application.StatusBar = False
 Application.Calculation = xlAutomatic
 End If
 End Sub

Since you still (even in XL 2013) cannot create more than 255 sheets in a single Worksheets.Add command the code creates the worksheets in blocks of 255.

Memory Used & File Size

In old versions of Excel (97/2000) this code hit the memory wall at about 200 worksheets.
In Excel 2013 32-bit you can get up to over 2500 worksheets but 4000 fails at about 1.4 gigabytes.
In Excel 2013 64-bit I got to 5 gigabytes of memory trying for 4000 sheets but it was so slow I gave up.

For 1500 sheets:

  • Excel 2010 32 uses about 430 Megabytes of memory for the workbook
  • Excel 2013 32 uses about 540 Megabytes of memory for the workbook
  • Excel 2013 64 uses about 770 Megabytes of memory for the workbook
  • The workbook takes about 40 Megabytes when saved as an XLSB

Comparing Within Sheet References and Between Sheets References

Generating the Formulas

I used 3 different methods for generating the within-sheet reference formulas (in R1C1 mode):

Method 1 uses a formula that refers to the previous column on this row:
=Sheet1000!RC[-1]

Method 2 uses a formula that always refers to column 1 for this row:
=Sheet1000!RC1

Method 3 uses a formula that randomly refers to rows and columns (nRequest is the number of sheets requested):
"=Sheet1000!R" & Int(Rnd() * nRequest + 1) & "C" & Int(Rnd() * nRequest + 1)

The reason for using 3 different formulas is to see what effect different kinds of references have on memory and calculation speed.

The complete code for method 3 looks like this:

</pre>
Sub MakeManyFormulas3()
 '
 ' make a single worksheet that refers to itself
 ' generate pairs of columns:
 ' numeric constant followed by formula that refers to a random row and column
 '
 Dim j As Long
 Dim k As Long
 Dim varSheets As Variant
 Dim nSheets As Long
 Dim nRequest As Long
 Dim nAdd As Long
 Dim var() As Variant
 '
 varSheets = Application.InputBox("Enter the Number of Formulas to Generate", "Inter-Linked Sheets", 200)
 If Not IsNumeric(varSheets) Then
 MsgBox "Input must be a number, MakeManyLinkedFormulas cancelled", vbOKOnly + vbCritical
 Exit Sub
 Else
 Application.ScreenUpdating = False
 Application.Calculation = xlManual
 Application.ReferenceStyle = xlR1C1
 nRequest = CLng(varSheets)
 '
 ' add constant and linkage formula
 '
 For j = 1 To nRequest
 ReDim var(nRequest, 2)
 For k = 1 To nRequest
 var(k, 1) = j * k
 '
 ' refers to random row and column
 '
 var(k, 2) = "=Sheet1000!R" & Int(Rnd() * nRequest + 1) & "C" & Int(Rnd() * nRequest + 1)
 Next k
 Worksheets("Sheet1000").Range("a1").Resize(nRequest, 2).Offset(0, (j - 1) * 2).FormulaR1C1 = var
 Next j
 Application.StatusBar = False
 Application.Calculation = xlAutomatic
 Application.ReferenceStyle = xlA1
 End If
 End Sub

Timing and Memory Results

The memory used is the difference between before and after memory (Private Working Set) as measured by Windows 7 Task Manager.

The Full Calculate Time is the time taken for the second or third multi-threaded (4 cores – 8 threads) full calculation of all 2.25 million formulas as measured by FastExcel.

Conclusions

I must admit I was surprised about the calculation times: I thought they would be larger for the between sheets references than for the within sheets references. But there is no real noticeable difference: a larger factor is where the within-sheet formulas refer to, or more likely the total number of unique formulas used.
(Random >column to the left> always first column)

XL Version Formula Method Memory (MB)
Full Calc Time
2010 32 Interlink Sheets

426 MB

0.30 Seconds

2010 32 Previous Column

320 MB

0.27 Seconds

2010 32 First Column

214 MB

0.26 Seconds

2010 32 Random

286 MB

0.60 Seconds

 
2013 32 Interlink Sheets

538 MB

0.29 Seconds

2013 32 Previous Column

402 MB

0.29 Seconds

2013 32 First Column

300 MB

0.26 Seconds

2013 32 Random

369 MB

0.66 Seconds

2013 64 Interlink Sheets

835 MB

0.33 Seconds

My conclusions from all this are:

  • 64-bit Excel uses more memory than 32-bit Excel
  • Interlinking sheets uses more memory than within-sheet references.
  • There is no significant calculation time penalty in using inter-sheet references
  • Excel 2013 uses more memory than Excel 2010
  • The more unique formulas there are the more memory and calculation time is needed
Posted in Calculation, Memory, VBA | Tagged | 1 Comment

Exploring Range.Calculate and Range.CalculateRowMajorOrder: fast but quirky formula calculation

The Range.Calculate methods are very useful additions to Excel’s other calculation methods (Application level Calculate, CalculateFull, CalculateFullRebuild and Worksheet.calculate: the missing one is Workbook.Calculate!).

You can use the Range Calculate methods to:

  • Force calculation of a block of formulas or a single formula
  • See how long the variations of a particular formula take to calculate
  • Speed up repeated calculations

Download my RangeCalc Addin

You can download my RangeCalc addin from my website’s downloads page (xla password is dm).

This adds a button to the addins tab which uses Range.Calculate to time calculation of the currently selected cells.

Inspecting the RangeCalc code: different problems with different versions

You can unlock the xla to view the code using a password of dm.
The code in the RangeCalc sub bypasses a number of Range.calculate quirks in various Excel versions:


Sub RngTimer()
 '
 ' COPYRIGHT © DECISION MODELS LIMITED 2000,2001. All rights reserved
 '
 ' timed calculation of selected Range
 '
 ' bypass grouped and interactive problem 17/10/00
 ' remove interactive=false: Excel 97 Hangs when UDF error 14/2/01
 ' fix for application.iteration and array formulae with Excel2002 29/10/2001
 '
 Dim dRangeTime As Double
 Dim iMsg As Integer
 Dim blIter As Boolean
 Dim oCalcRange As Range ''' range to calculate
 Dim dOvhd As Double
 Dim strMessage As String
 '
 ' store iteration property
 '
 blIter = Application.Iteration
 '
 If ActiveWorkbook Is Nothing Or ActiveSheet Is Nothing Or ActiveWindow Is Nothing Or Selection Is Nothing Then
 Exit Sub
 Else
 If TypeName(Selection) = "Range" Then
 '
 ' if Excel2002 or greater handle iteration problem
 '
 If Left(Application.Version, 1) = "1" Then
 '
 ' switch off iteration
 '
 Application.Iteration = False
 End If
 '
 ' expand selected range to include all of any multicell array formula
 ' - makes Excel 2002 behave like earlier versions
 ' - allows notification if range has been expanded
 '
 Call ExpandRange(Selection, oCalcRange)
 '
 On Error GoTo errhandl
 '
 dOvhd = MicroTimer ''' ensure frequency is initialised
 dOvhd = MicroTimer ''' get time
 dOvhd = MicroTimer - dOvhd ''' calc microtimer overhead
 '
 dRangeTime = MicroTimer
 oCalcRange.Calculate
 dRangeTime = MicroTimer - dRangeTime - dOvhd
 '
 On Error GoTo 0
 '
 dRangeTime = Int(dRangeTime * 100000) / 100
 '
 ' 16/11/2009 - bypass multi-cell array formula problem
 '
 If Val(Application.Version) > 9 And Val(Application.Version) < 12 Then
 oCalcRange.Dirty
 End If
 '
 ' change message if array formula caused expansion of selection
 '
 If oCalcRange.Count = Selection.Count Then
 strMessage = CStr(Selection.Count) & " Cell(s) in Selected Range "
 Else
 strMessage = CStr(oCalcRange.Count) & " Cell(s) in Expanded Range "
 End If
 iMsg = MsgBox(strMessage & CStr(dRangeTime) & " Milliseconds", vbOKOnly + vbInformation, "RangeCalc")
 End If
 End If
 Application.Iteration = blIter ''' restore setting
 Set oCalcRange = Nothing
 Exit Sub
 errhandl:
 On Error GoTo 0
 Application.Iteration = blIter ''' restore setting
 Set oCalcRange = Nothing
 iMsg = MsgBox("Unable to Calculate Range", vbOKOnly + vbCritical, "RangeCalc")
 End Sub

Circular References

Using Range.Calculate on ranges that contain circular references within the range fails in Excel versions before Excel 2007.
In Excel 2007 and later Range.calculate only does a single iteration of the circular reference in Manual calculation mode, regardless of the Iteration settings.
So the RangeCalc addin switches iteration off whilst doing the Range.Calculate.

Multiple Sheets Selected

If you have multiple sheets selected Range.Calculate fails with a 1004 error, so the RangeCalc code has an error trap and message for any failure in Range.Calculate.

Multiple Areas selected  on a single Sheet

Range.Calculate will happily calculate a multi-area selection as long as all the areas are on the same sheet.

Multi-Cell Array formulas

If you do not select all the cells in a multi-cell array formula Range.Calculate will fail. My RangeCalc addin solves this problem by:

  • Automatically expanding the range to calculate to include all the cells in any array formula which intersects the selected range
  • Notifying the user that the range has been expanded

The VBA code to exapnd the range looks like this:


Sub ExpandRange(oStartRange As Range, oEndRange As Range)
 '
 ' COPYRIGHT © DECISION MODELS LIMITED 2000,2001. All rights reserved
 '
 ' Input: oStartRange, a range object that may or may not contain array formulae
 ' Output: oEndRange, a range object that has been expanded -
 ' to include all the cells in any array formula that is partly in the range
 '
 Dim oCell As Range
 Dim oArrCell As Range
 '
 ' loop on cells in oStartRange
 ' and expand range to include all the cells in any array formulae
 '
 On Error Resume Next
 '
 Set oEndRange = oStartRange
 For Each oCell In oStartRange
 If oCell.HasArray = True Then
 For Each oArrCell In oCell.CurrentArray
 '
 ' add any extra array cells
 '
 If Intersect(oEndRange, oArrCell) Is Nothing Then
 '
 ' if this cell is not in the expanded range then add it
 '
 Set oEndRange = Union(oEndRange, oArrCell)
 End If
 Next oArrCell
 End If
 Next oCell
 Set oCell = Nothing
 Set oArrCell = Nothing
 End Sub

There is also another problem with multi-cell array formulas and Range.Calculate, but it only exists in Excel 2002 and 2003 (after a Range.Calculate the array formula gets evaluated once for each cell it occupies in all subsequent recalculations). This problem is bypassed by using Range.Dirty on the range!

Note: The bug in Range.Dirty  is still there in Excel 2013. (it always works on the active sheet even when the range refers to another sheet!)

Range.Calculate and Range.CalculateRowMajorOrder – different handling of within-range dependencies

In early Excel versions (Excel 97 and 2000) Range.Calculate used a very simple calculation method: calculate the cells in each row in turn from left to right and ignore any forward references or within range dependencies. This method is fine as long as you know thats what it does and arrange your formulas accordingly (otherwise you may get incorrect results)!

But some people thought this was a bug, so it got fixed in Excel 2002 and 2003 (and later versions): Range.Calculate now starts by doing the left-to right calculation on each row in turn, and then starts recalculating any cells that refer to uncalculated cells within the range. In other words it achieves the same result as the standard Excel recalculation method.

The only problem was that this made Range.Calculate slower than in previous versions: and so some customers refused to upgrade because they could not run their bump runs fast enough!

So in Excel 2007 Microsoft solved the problem by introducing Range.CalculateRowMajorOrder. This method worked exactly the same way as the Excel 97 versions of Range.Calculate and was faster than the new Range.Calculate, and so everyone was happy except the VBA coders who had to work out when to use which method.

Some more Range.Calculate Limitations

Whilst the 2 Range Calculate methods are very useful, they do have some limitations:

  • They are both single-threaded calculation methods (In todays world this a serious limitation)
  • There is no keystroke sequence to initiate them from the UI (FastExcel uses Alt-F9 for this)
  • Re-entrant use of Range.Calculate is not allowed: for instance you can’t call Range.Calculate from inside a UDF
  • Range.Calculate works in US english dates etc.

Summary

  • Range.Calculate and Range.CalculateRowMajorOrder can be fast calculation methods
  • But they are not multi-threaded
  • For me they are essential tools for comparing formula speed
  • They need a bit of wrapping code, as in my RangeCalc addin, to make them generally useful.
Posted in Calculation, VBA | Tagged , | 1 Comment

Parsing Functions from Excel Formulas using VBA: Is MID or a Byte array the best method?

As part of extending the performance profiling abilities of FastExcel, I wanted to develop a Function Profiler Map. A key component of this is to extract the names of the functions embedded in Excel formulas.

So I experimented with some different approaches:

  • using Rob Van Gelder’s AudXL formula parser
  • using the MID function to scan through character by character
  • using byte arrays

The Test Data

For performance testing I am using a large worksheet with 638K used cells, 103K constants and 413K formulas. There are 3 functions on this sheet (710K INDEX, 5 SUM and 51K IF).

For validity testing I am using a small worksheet with some tricky formulas:

ParseFormulas1

Rob Van Gelder’s AudXL formula parser

Rob converted a javascript-based Excel formula parser  written by Eric Bachtal to VBA. You can download AudXL.xla from here. It is a useful tool for breaking formulas apart so that they are more readable.

So I started with that as a performance baseline: it takes 130 seconds to parse the large test worksheet.

But of course a general purpose formula parser should be slower than a purpose-built parser that only finds the functions.

Rules for Finding Functions in Formula strings

  • Function names are always followed by open bracket “(“
  • Function Names are always preceded by an operator
    +-,/*(=><& :!^
  • Formula text enclosed in single or double quotes cannot contain a function

Assumptions

  • Formulas to be parsed will be obtained using Range.FormulaR1C1.
  • This means that the formula string will will use American English settings and native function names.
  • And I only have to parse formulas that have been validated by Excel.
  • The full range of Unicode characters are allowed, so I can’t just use ASCII codes.

Download the VBA Code

You can download the routines containing the VBA from here

The First attempt: using Byte arrays

In a previous post I discussed using Byte arrays as an efficient method for processing all the characters in a string.
This is a good method for handling Unicode strings – each character gets converted into 2 separate bytes (0 to 255) which uniquely define the character and cover all the possible National Language characters in the world.

So “(” for instance is always 40 and 0 and


Dim abFormula() as Byte
Dim str1 as string
str1="=NA()"
abFormula=str1


produces:

ParseFormulas2

Note that the Byte arrays produced by assigning a string to a byte array are always zero-based.

You can also do the reverse: assigning a byte array to a string converts back.

The First Algorithm

So here is an algorithm that follows the rules:

Scan each character in the byte array from left to right:
If the character is a single or double quote ignore all subsequent characters until there is another single or double quote.
if the character is in the list of operators (abStartChars) then set a flag
if the character is NOT in the list of operators check if its a “(” and
if it IS a “(” and there are one or more non-operator characters preceding it we have found a function name.

The code looks like this:


Function GetFunc1(abFormula() As Byte, abStartChars() As Byte, abEndChar() As Byte, _
 abQuotes() As Byte, abSQ() As Byte, jStart As Long, jEnd As Long) As String
 '
 ' search for a function name within a byte array starting at the jStart byte position
 ' byte array is zero based pairs of bytes
 ' abFormula is a byte array version of the formula
 ' abStartChars is a byte array containing characters that can precede the function name
 ' abEndChar is a byte array containing "("
 ' abQuotes is a byte array contianing a double quote
 ' abSQ is a byte array containing a single quote
 '
 ' returns name of function as a string and jEnd as the byte position of "("
 '
 Dim j As Long
 Dim k As Long
 Dim jStartChar As Long
 Dim jFirst As Long
 Dim blStart As Boolean
 Dim blString As Boolean
 Dim abFunc() As Byte
 '
 jFirst = jStart + 2
 blString = False
 For j = jStart + 2 To (UBound(abFormula) - 2) Step 2
 '
 ' skip text strings
 '
 If (abFormula(j) = abQuotes(0) And abFormula(j + 1) = abQuotes(1)) _
 Or (abFormula(j) = abSQ(0) And abFormula(j + 1) = abSQ(1)) Then
 blString = Not blString
 End If
 If Not blString Then
 '
 ' look for non startchar
 '
 blStart = False
 For jStartChar = 0 To UBound(abStartChars) Step 2
 If abFormula(j) = abStartChars(jStartChar) _
 And abFormula(j + 1) = abStartChars(jStartChar + 1) Then
 blStart = True
 jFirst = j + 2
 Exit For
 End If
 Next jStartChar
 If Not blStart Then
 If abFormula(j) = abEndChar(0) And abFormula(j + 1) = abEndChar(1) Then
 '
 ' we have a (
 '
 If j > jFirst Then
 '
 ' we have a function
 '
 jEnd = j
 '
 ' jend points to first byte of the ( character
 ' jfirst points to the first byte of the function name
 ' convert slice of formula to function name string
 '
 ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte
 For k = 0 To UBound(abFunc)
 abFunc(k) = abFormula(jFirst + k)
 Next k
 GetFunc1 = abFunc
 Exit Function
 '
 ElseIf abFormula(jFirst) = abEndChar(0) _
 And abFormula(jFirst + 1) = abEndChar(1) Then
 jFirst = jFirst + 2
 End If
 End If
 End If
 End If
 Next j
 End Function

And the driver routine just loops on all the formulas in the worksheet, calling the function parsing routine and storing/counting the functions found in a dictionary.


Sub testing1()
 Dim strFormula As String
 Dim strFunc As String
 Dim abFormula() As Byte
 Dim abStartChars() As Byte
 Dim abEndChar() As Byte
 Dim abQuotes() As Byte
 Dim abSQ() As Byte
 Dim jStart As Long
 Dim jEnd As Long
 Dim oFormulas As Range
 Dim oCell As Range
 Dim dTime As Double
 Dim dicFuncs As New Dictionary
 '
 ''' characters that can come before the start of a Function name
 Const strStartChars1 As String = "+-,/*=><& :!^" & vbLf
 '
 dTime = MicroTimer
 '
 abStartChars = strStartChars1
 abEndChar = "("
 abQuotes = Chr(34)
 abSQ = Chr(39)
 '
 dicFuncs.CompareMode = TextCompare
 '
 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
 '
 For Each oCell In oFormulas
 strFormula = oCell.FormulaR1C1
 abFormula = strFormula
 jStart = 0
 jEnd = 0
 Do
 strFunc = UCase(GetFunc1(abFormula, abStartChars, abEndChar, abQuotes, abSQ, jStart, jEnd))
 If LenB(strFunc) = 0 Then Exit Do
 jStart = jEnd
 '
 ' add Function to dictionary and count occurrences
 '
 If dicFuncs.Exists(strFunc) Then
 dicFuncs(strFunc) = dicFuncs(strFunc) + 1
 Else
 dicFuncs.Add strFunc, 1
 End If
 Loop
 Next oCell
 '
 MsgBox MicroTimer - dTime
 End Sub

This version takes 48 seconds: better but with room for improvement.

The (nearly) Final Algorithm

  • I don’t need to worry about upper-lower case Function names, since Excel already takes care of that, so I can use the default Binary Compare for the dictionary
  • All the special characters for operators etc always have byte 2 =0 so I only need to test that once per character rather on each comparison.
  • If I first search left to right for the “(” character and then work backwards looking for an operator character the function does many fewer comparison operations, because there is only 1 end character but many start characters.

The resulting VBA code looks like this:


Function GetFunc3(abFormula() As Byte, abStartChars() As Byte, jStart As Long, jEnd As Long) As String
 '
 ' search for a function name within a byte array bstarting at the jStart byte position
 ' byte array is zero based pairs of bytes
 ' abFormula is a byte array version of the formula
 ' abstartchars is a byte array containing characters that can precede the function name
 '
 ' returns a string name of function and jEnd as the position of "("
 '
 Dim j As Long
 Dim k As Long
 Dim jj As Long
 Dim jStartChar As Long
 Dim jFirst As Long
 Dim blStart As Boolean
 Dim blDoubleQ As Boolean
 Dim blSingleQ As Boolean
 Dim abFunc() As Byte
 '
 jFirst = jStart + 2
 blDoubleQ = False
 For j = jStart + 2 To (UBound(abFormula) - 2) Step 2
 '
 ' start and end characters always have byte 2 =0
 '
 If abFormula(j + 1) = 0 Then
 '
 ' skip text strings
 '
 If abFormula(j) = 39 Then blSingleQ = Not blSingleQ
 If Not blSingleQ Then
 If abFormula(j) = 34 Then blDoubleQ = Not blDoubleQ
 If Not blDoubleQ Then
 '
 ' look for (
 '
 If abFormula(j) = 40 Then
 '
 ' we have a (
 ' look backwards for a startchar
 '
 blStart = False
 For jj = j - 2 To jStart Step -2
 For jStartChar = 0 To UBound(abStartChars) Step 2
 If abFormula(jj) = abStartChars(jStartChar) Then
 blStart = True
 jFirst = jj + 2
 Exit For
 End If
 Next jStartChar
 If blStart Then Exit For
 Next jj
 If blStart Then
 If j > jFirst Then
 '
 ' we have a function
 '
 jEnd = j
 Exit For
 ElseIf abFormula(jFirst) = 40 Then
 jFirst = jFirst + 2
 End If
 End If
 End If
 End If
 End If
 End If
 Next j
 If blStart And jEnd > jFirst Then
 '
 ' convert slice of formula to function name string
 ' jend points to first byte of the ( character
 ' jfirst points to the first byte of the function name
 '
 ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte
 For k = 0 To UBound(abFunc)
 abFunc(k) = abFormula(jFirst + k)
 Next k
 GetFunc3 = abFunc
 End If
 End Function

And the corresponding driver routine looks like this:


Sub testing3()
 Dim strFunc As String
 Dim abFormula() As Byte
 Dim abStartChars() As Byte
 Dim jStart As Long
 Dim jEnd As Long
 Dim oFormulas As Range
 Dim oCell As Range
 Dim dTime As Double
 Dim dicFuncs As New Dictionary
 '
 ''' characters that can come before the start of a Function name
 Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf
 '
 dTime = MicroTimer
 '
 abStartChars = strStartChars2
 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
 '
 For Each oCell In oFormulas
 abFormula = oCell.FormulaR1C1
 jStart = 0
 jEnd = 0
 Do
 strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd)
 If LenB(strFunc) = 0 Then Exit Do
 jStart = jEnd
 If dicFuncs.Exists(strFunc) Then
 dicFuncs(strFunc) = dicFuncs(strFunc) + 1
 Else
 dicFuncs.Add strFunc, 1
 End If
 Loop
 Next oCell
 MsgBox MicroTimer - dTime
 End Sub

This version takes 11.6 seconds: but maybe I can make it faster?

Using MID$() instead of a Byte array

What happens if I forget about all this Byte array stuff and just implement the same algorithm using MID$() to extract each character from the formula and INSTR to check against the list of start operators (start characters)?

The GetFunc VBA code looks like this:


Function GetFunc4(strFormula As String, strStartChars As String, strEndChar As String, strQuotes As String, strSQ As String, jStart As Long, jEnd As Long) As String
 '
 ' search for a function name within a formula string starting at the jStart character position
 '
 ' strStartChars is a string containing characters that can precede the function name
 '
 ' returns a string name of function and jEnd as the position of "("
 '
 Dim j As Long
 Dim k As Long
 Dim jj As Long
 Dim jStartChar As Long
 Dim jFirst As Long
 Dim blStart As Boolean
 Dim blDoubleQ As Boolean
 Dim blSingleQ As Boolean
 Dim abFunc() As Byte
 Dim strChar As String
 Dim iStartChar As Long
 '
 jFirst = jStart + 1
 blDoubleQ = False
 For j = jStart + 1 To (LenB(strFormula) - 1)
 strChar = Mid$(strFormula, j, 1)
 '
 ' skip text strings
 '
 If strChar = strSQ Then blSingleQ = Not blSingleQ
 If Not blSingleQ Then
 If strChar = strQuotes Then blDoubleQ = Not blDoubleQ
 If Not blDoubleQ Then
 '
 ' look for (
 '
 If strChar = strEndChar Then
 '
 ' we have a (
 ' look backwards for a startchar
 '
 blStart = False
 For jj = j - 1 To jStart Step -1
 strChar = Mid$(strFormula, jj, 1)
 iStartChar = InStrB(strStartChars, strChar)
 If iStartChar > 0 Then
 blStart = True
 jFirst = jj + 1
 Exit For
 End If
 Next jj
 If blStart Then
 If j > jFirst Then
 '
 ' we have a function
 '
 jEnd = j
 '
 ' convert slice of formula to function name string
 ' jend points to first byte of the ( character
 ' jfirst points to the first byte of the function name
 GetFunc4 = Mid$(strFormula, jFirst, jEnd - jFirst)
 Exit Function
 ElseIf Mid$(strFormula, jFirst, 1) = strEndChar Then
 jFirst = jFirst + 1
 End If
 End If
 End If
 End If
 End If
 Next j
 End Function

Using MID$ and INSTR is slower: it takes 21.8 seconds.

Optimising the Driver Routine.

OK so I am reasonably happy with the parsing routine: I have gone from 130 seconds for a generic parsing routine to 11.6 seconds for the specialised Byte array routine.

But the driver routine looks at every single formula on the worksheet, and we know that on most large worksheets a large percentage of the formulas are copied.

So I can use the dictionary approach to find the distinct formulas (use R1C1 rather than A1 mode because copied formulas are identical in R1C1) and just parse those:

Testing3C also makes some more speed improvements by looping on each area and getting the formulas into a variant array rather than looping directly on the cells.


Sub testing3c()
 Dim strFunc As String
 Dim abFormula() As Byte
 Dim abStartChars() As Byte
 Dim jStart As Long
 Dim jEnd As Long
 Dim oFormulas As Range
 Dim oCell As Range
 Dim oArea As Range
 Dim vArr As Variant
 Dim vF As Variant
 Dim dTime As Double
 Dim dtotTime As Double
 Dim dicFuncs As New Dictionary
 Dim dicFormulas As New Dictionary
 Dim j As Long
 Dim k As Long
 '
 Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf ''' characters that can come before the start of a Function name
 '
 dTime = MicroTimer
 '
 abStartChars = strStartChars2
 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
 '
 ' find distinct formulas in areas and count occurrences
 '
 For Each oArea In oFormulas.Areas
 '
 ' get a block of formulas
 '
 vArr = oArea.FormulaR1C1
 If IsArray(vArr) Then
 For k = 1 To UBound(vArr, 2)
 For j = 1 To UBound(vArr)
 If dicFormulas.Exists(vArr(j, k)) Then
 dicFormulas(vArr(j, k)) = dicFormulas(vArr(j, k)) + 1
 Else
 dicFormulas.Add vArr(j, k), 1
 End If
 Next j
 Next k
 Else
 If dicFormulas.Exists(vArr) Then
 dicFormulas(vArr) = dicFormulas(vArr) + 1
 Else
 dicFormulas.Add vArr, 1
 End If
 End If
 Next oArea
 '
 ' parse only the distinct formulas
 '
 For Each vF In dicFormulas
 abFormula = vF
 jStart = 0
 jEnd = 0
 Do
 strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd)
 If LenB(strFunc) = 0 Then Exit Do
 jStart = jEnd
 If dicFuncs.Exists(strFunc) Then
 dicFuncs(strFunc) = dicFuncs(strFunc) + dicFormulas(vF)
 Else
 dicFuncs.Add strFunc, dicFormulas(vF)
 End If
 Loop
 Next vF
 MsgBox MicroTimer - dTime
 End Sub

Now this takes only 3.5 seconds: adding a formula string to a dictionary is much faster than parsing it!

Conclusion

  • Byte Arrays can be significantly faster for character by character operations rather than using MID and INSTR.
  • Looking at how the first attempt at an algorithm works can give you clues about how to improve it.
  • Even using Byte arrays string parsing operations are slow in VBA.
  • Using a dictionary its really fast to find the distinct formulas even on a large worksheet.

Challenge

OK guys: who can write a faster VBA function parsing routine?

Download the routines from here

Posted in Calculation, VBA | Leave a comment

UNIQUES and DISTINCTS: exploring lists with LISTDISTINCTS

I just added some options to the SpeedTools LISTDISTINCTS functions that make them surprisingly powerful. You can now easily find the most frequently occurring item in a list, or find the item with the largest sum or average of a corresponding column.

But first since there is disagreement about the meaning of the terms UNIQUES and DISTINCTS I should explain what I mean:

  • A unique item in a list is one that only occurs once
  • Distinct items in a list can occur once or more than once

Creating a list of distinct items

Suppose you have a list formatted as a table:

Distincts1

Then entering the formula =LISTDISTINCTS(Table1) as a multi-cell array formula (select 14 cells in a column, enter the formula in the formula bar and press Control/Shift/enter) gives you this:

Distincts2Notice that the items appear in the sequence of their first occurrence in the list.

There are rather a lot of #N/As since Excel pads out the excess cells (the cells for which the array formula did not return anything) with #N/A.
But of course LISTDISTINCTS allows us to fix that using the PAD option in the formula =LISTDISTINCTS(Table1,,,,,1)
Pad can be 0 = pad with #N/A, 1 = pad with “”, 2 pad with zero.

Distincts3That looks better, but OOPS the #N/A in the list has disappeared! Thats because the default option for LISTDISTINCTS is to ignore error values, blanks and empty cells, so we just need to change the Ignore option to 2.

Distincts4

And the result looks like this: (it shows #ERROR rather than #N/A so that you can distinguish it from the padding #N/A)

Distincts5There some more options for LISTDISTINCTS

Distincts6

Case_Sense defaults to false, so the aa and AA in the list are treated as being the same.

If your list of items has more than one column you can either ask for a list of distinct rows (ByRows=True) or a list of all the distinct items across all the columns.

And you can sort the result list ascending (Sort=1), descending (Sort=2) or leave it unsorted (Sort=0).

Here is an example of LISTDISTINCTS sorted ascending, case-sensitive, pad with blanks, include errors, showing the difference with ByRows True and ByRows False.

Distincts7

Counting Distinct Items

There are 2 variations of LISTDISTINCTS for counting the number of distinct items: COUNTDISTINCTS and LISTDISTINCTS.COUNT

Distincts8COUNTDISTINCTS is not an array formula and just gives you the count.
But LISTDISTINCTS.COUNT adds an extra column that gives the count of occurrences of each of the distinct items.

Finding the most frequently occurring item

You can also sort the output of LISTDISTINCTS.COUNT most frequent occurrences first (Sort=-2) or last (Sort=2).

So the formula =LISTDISTINCTS.COUNT(A21:A34,,,,-2) (not an array formula, entered in a single cell) returns AA which is the most frequently occurring item.

Sums and Averages for distinct items.

As well as LISTDISTINCTS.COUNT there are LISTDISTINCTS.SUM and LISTDISTINCTS.AVG
These take an additional column argument showing what to sum or average for each distinct item.
And, just like LISTDISTINCTS.COUNT, you can sort the output either on the distinct items or on the resulting sums or averages.

distincts10

Distinct9

Summary

Of course you can achieve similar things with Pivot Tables and PowerPivot. But there are many occasions when I find that a simple formula that automatically refreshes whenever Excel recalculates is a better solution.

And adding the option to sort ascending or descending on either the item list or the count, sum or average adds a lot of pwer to the functions.

So what do you use for this kind of thing: Formula, UDF, Pivot Table or PowerPivot?

Posted in Calculation, UDF, XLL | Tagged , , | Leave a comment

Exploring Conditional Format Performance Part 3: What’s slow, whats buggy and whats faster!

This is the third in a series of Posts on Conditional Formats (see part 1 and Part2).

This post looks at the effects  on the performance of Conditional Formats of:

  • Application.Screenupdating
  • Application.EnableConditionalFormatsCalculation
  • Application.Calculation
  • Whether the cells containing the conditional formats are visible or not
  • Screen Refresh
  • Excel 2007, Excel 2010 and Excel 2013

The workbook I am using is called (with stunning originality) FormatConditionsB.xlsb, and you can download it from my Skydrive.

It contains 1.9 million Rand() formula in A1:Z72858, and each of these cells has 3 conditional format rules:

CFEx3_1So thats 5.7 million conditional format rules.

There are 2 worksheets: Formats and Empty

The workbook also contains the MicroTimer api code for high resolution timing and 5 subs, Testing 1 through 4 and testscroll1.
The subs typically set calculation mode, screenupdating and enableformatconditionscalculation, time a calculation and then time a screen update.
For example here is the code for Testing1:


Sub testing1()
 Dim osht As Worksheet
 Dim dtime As Double
 Application.Calculation = xlCalculationManual
 Worksheets("Formats").Activate
 Set osht = Worksheets("Formats")
 Application.ScreenUpdating = True
 osht.EnableFormatConditionsCalculation = True
 dtime = MicroTimer
 Application.Calculate
 dtime = MicroTimer - dtime
 Debug.Print dtime
 dtime = MicroTimer
 Application.ScreenUpdating = True
 dtime = MicroTimer - dtime
 Debug.Print dtime
 End Sub

Timings with different sheets visible.

The workbook opens with the formats sheet visible.
If you click the Empty tab you instantly see the empty sheet.
But if you then switch back to the Formats sheet there is a noticeable delay of about a second before the screen refreshes. Similarily pressing Page Up takes just over a second before the screen refreshes.

This is because Excel re-evaluates the conditional formats for the visible cells on the active sheet at each screen refresh.

Prssing F9 to recalculate the 1.9 million RAND() formulas with the Formats sheet visible takes 2.8 seconds, but with the Empty sheet visible it takes 0.2 seconds- again its the evaluation of the visible conditional formats that takes the time.

Conditional Formats are not directly evaluated by a calculation.

Running the Testing Subroutines

Here are the timings in seconds for running Test1 through Test4, with the Formats sheet visible.

CF2_Timings1

The conclusions of this test are:

Excel 2010 and Excel 2013 are noticeably faster than Excel 2007.

  • Turning off screen updating is the big winner
  • Switching off EnableFormatConditionsCalculation is only worthwhile if ScreenUpdating is true
  • Switching off EnableFormatConditionsCalculation is much less effective than switching off ScreenUpdating
  • Although Refresh looks very fast in Excel 2013 it actually just postpones the refresh to after the VBA has finished, so in fact its not faster.

I then repeated the tests, but with the Empty sheet visible rather than the Formats sheet:

CF2_Timings2

This completely avoids the refresh evaluation of the conditional formats and the times are comparable to the first set of tests with Screen Updating False.

I also tried repeating the tests with the Formats sheet active but hidden behing the VBE window.
The timings were virtually the same as with the Formats sheet visible.

So its the refresh of the conditional format cells within the activesheet window that uses the time, even if its hidden behind some other window.

I also ran TestScroll1. This times the effect of a complete scroll of the conditional formats window.

CF2_Timings3

As you can see the scroll times are comparable to the refresh times in the first set of tests, except for Excel 2013.
But the Excel 2013 refresh timings in the first test are cheating because the refresh actually takes place after the VBA sub has ended.

Range.Calculate and Range.CalculateRowMajorOrder

If you use Range.CalculateRowMajorOrder on a single cell (or a large block of cells) it takes about 1.4 seconds – the same time as a scroll/screen refresh.
But Range.Calculate takes almost exactly twice as long – looks like it causes 2 screen refreshes not one!

Seriously Slow Conditional Formats

If you want to play with a workbook containing some seriously heavyweight conditional formats you can download ConditionalFormatsC.xlsb

This has 132K formulas =INT(RAND()*1000)  in A1:V6000 and each cell has a single formatting rule to colour orange duplicated values in A1:V6000. (well of course they all turn orange).

With the Formats sheet visible pressing F9 to recalculate takes about 40 seconds.
And it looks like evaluating the conditional formats is all single-threaded: no advantage from multiple cores!

But with the Empty sheet visible F9 takes 0.03 seconds.

Conclusions

  • Heavy conditional formatting can be slow
  • Conditional Format evaluation is single-threaded
  • EnableFormatCondtionsCalculation is not very useful
  • Evaluation of conditional format rules takes place at screen refresh time rather than calculation time
  • Only the conditional format rules for cells that are shown on the active window(s) get evaluated
    (large screens will be slower than small screens and zoom out slows you down!)
  • ScreenUpdating=false works well, but the final refresh time will occur when the Sub is exited.
  • Using UDFs in conditional formats is probably not a good idea
  • The interaction of VBA and conditional formats looks buggy
  • Excel 2013 and 2010 are faster than 2007 for Conditional Formats

2 other bugs with conditional formats have been reported, but I don’t know if they have been fixed in Excel 2013:

  • Opening a file created in Excel 2003 with Excel 2007 could make the conditional formats fail to refresh unless you manually set EnableFormatConditionsCalculation=true
  • Repeated copy-pasting Conditional formats in Excel 2007 duplicated the conditional formatting rules so that large numbers of rules were created.

So whats your experience with Conditional Formats?

Posted in Calculation, Formatting, UDF, VBA | Tagged , | 4 Comments

Exploring Conditional Format Performance Part 2: What’s slow, whats buggy and whats faster!

This is the second in a series of Posts on Conditional Formats (see part 1).

This post looks at the effects (and the resulting bugs!)  on Conditional Formats of:

  • Application.Screenupdating
  • Application.EnableConditionalFormatsCalculation
  • Application.Calculation
  • Whether the cells containing the conditional formats are visible or not
  • Screen Refresh
  • Excel 2007, Excel 2010 and Excel 2013

I am using the same (but slightly updated) test workbook as in Part 1: you can download it from SkyDrive.

Running the Tests

The FormatConditionsA.xlsb workbook contains 12 VBA subroutines to do the testing (Test1 through Test3C).

Cell B2 uses 2 UDFs (Signal1 and Signal2) to determine whether B2 is even or odd, and the B2 formula refers to D21

Cell E2 has 2 conditional format formulas that test directly whether D21 is even or odd.

CFEx1_1You need to run the tests with the VBE window open and the immediate window visible.
For a more detailed explanation of this example workbook see part 1.

If you run the tests using Excel 2007, Excel 2010 and Excel 2013 you will see that a lot of work has been done by the Excel team to minimise the number of times the conditional formats get executed. But (as always when doing optimisations) this has tended to introduce bugs.

Test1: Screenupdating=True, Enable=True, Calc=Auto

  • Excel 2013: OK, 1 call to each UDF, large pause of a second or two before cell B2 refreshes its colour.
  • Excel 2010: OK , 2 calls to each UDF, no noticeable pause.
  • Excel 2007: OK, 4 calls to each UDF, no noticeable pause.

Test2: Screenupdating=true, Enable=true, Calc=manual

  • Excel 2007 & Excel 2010: OK
  • Excel 2013: Bug in cell B2neither of the conditional formats is applied to B2 and neither of the UDFs are executed. Scrolling down and up to refresh the screen does not fix this, but pressing F9 does.

CFEx2_1

Test3: ScreenUpdating=False, Enable=True, Calc=Manual

  • Excel 2013: OK
  • Excel 2010: Bug in B2. Neither of conditional formats are applied to B2 and the UDFs are not executed. Page Down Page Up does not fix but F9 does.
  • Excel 2007: Bug in B2. Page Down Page Up fixes.

Test1A: Screen=True, EnableFormatConditionsCalculation=False, Calc=Auto

So what does setting EnableFormatConditionsCalculation to False actually do?
I am not sure, but what it does NOT do is to permanently switch off the evalution of conditional formats!

  • Excel 2007: OK – the pause in Test 1 has disappeared!
  • Excel 2010: Bug in Cell E2. the left-most vertical border is coloured correctly but the rest of the cell is not! Page Down Page Up fixes it.
  • CFEx2_1A
  • Excel 2013: Bug in cell B2 and E2. Page Down Page up fixes it.CFEx2_1A2013

Test2A: Screen=True, EnableFormatConditionsCalculation=False, Calc=Manual

  • Excel 2007: OK
  • Excel 2010: Bug in cell E2. Page Down Page Up fixes it.
  • Excel 2013: Bug in cell B2. Neither Page Down Page Up nor F9 fix it, but Ctrl/Alt/F9 does.

Test3A: Screen=False, EnableFormatConditionsCalculation=False, Calc=Manual

  • Excel 2013: OK
  • Excel 2010: Bug in cell B2: Neither Page Down Page Up nor F9 fix it, but Ctrl/Alt/F9 does
  • Excel 2007: Bug in cell B2. Page Down Page Up fixes it.

Tests 1B to 3C: switching to another sheet, run the tests, switch back

Its magic: all these tests run correctly in all versions!

Conclusions

  • Looks like using UDFs in conditional format formulas is rather buggy: avoid.
  • EnableFormatConditionsCalculation does not look useful.
    But there were many reports of a problem importing Excel 2003 files with conditional formats into later versions that could be fixed by setting it to True: I don’t know if this problem still exists.
  • The safest way is to activate a sheet that does not contain any conditional formats.

The next post will focus on the performance of conditional formats.

Posted in Calculation, Formatting, UDF, VBA | Tagged , | Leave a comment

Exploring Conditional Format Performance Part 1: What’s slow, whats buggy and whats faster!

Patrick wanted to know if I had any information on Conditional Format calculation and performance, and I have  not looked at it for several years, so here goes!

I have done a series of experiments, using Excel 2007, 2010 and 2013, to try and get some insight on what Excel is doing under the covers. Because there is a lot to cover I have split the post into 3 parts.

This first part covers a simple experiment to see when Conditional formats get executed.

Formatting versus Calculation.

What Excel shows you on the screen or in a printout is the formatted (rendered) version of the results of a calculation.
And because formatting/rendering is such a cpu-intensive process Excel has a lot of tricks to try and minimise the time used (and thats why using Application.Screenupdating=False should be used everywhere to speedup your VBA).

Conditional Formats often do both calculation and formatting, so you have got double the chance of things being slow!

Excel does not generally allow formatting to be part of the calculation chain because formatting occurs after the calculation has finished.
This is also true
for conditional formatting, although it its not clear to what extent there is a separate calculation-of-conditional-formats step before the formatting step.

Excel dynamically formats (re-paints) only what you see on the screen.

To save time Excel only does final formatting for the part of the results you can see on the screen. (so large screens are slower than small ones, and zooming out a long way is slower!). When you have a lot of conditional formats this can cause very noticeable delays in scrolling a page up or down,

Conditional Formats can be Super-Volatile

Because of this dynamic repainting conditional formats are often executed even when no calculation occurs (for instance when you scoll up or down). So its not usually a good idea to embed a heavy calculation into a Conditional Format formula!

Lets start by looking at a very simple example that allows you to track when a condtional format gets executed. You can download the workbook FormatConditionsA.xlsb from SkyDrive. Note it contains VBA so will not run properly in the Excel Web App.

Test workbook FormatCondtionsA.xlsb

The workbook uses 3 cells and 2 VBA UDFs:

  • Cell B2 contains a formula =D21 and has two conditional format rules – colour orange if =signal1(b2) and colour green if =signal2(b2).
    Signal1 and Signal2 are VBA UDFs that increment a calculation counter and show it in the immediate window. Signal1 returns TRUE if B2 is an odd number and Signal2 returns TRUE if B2 is an even number.
  • Cell E2 contains 2 conditional format rules that directly check cell D21 for odd (orange) or even (green).
  • Cell D21 contains a number which you can change to either odd or even to see the effect on the conditional formats.

To run the experiments open the workbook and press F11 to see the VBIDE, then press Ctrl G to View the immediate window.
Then arrange the Excel window and the VBE window so that you can see both of them, and make sure that you can see Row 2 through 21 of the Excel window.

CFEx1_1

Experiment 1: Automatic calculation mode, User-interface driven

Switch to Automatic Calculation mode.
Clear the VBE immediate window.
Select Cell D21 and increment the number by 1.

Both cell E2 and B2 should change colour, and the Immediate Window shows how many times the UDFs have been calculated.

CfEx1_2

With Excel 2013 I get a total of 10 executions of the UDFs! (5 of each) !!! (No, I have absolutely no idea why, thats got to be a bug.)

Excel 2007 and 2010 only do 4 executions (2 for each UDF).

Experiment 2: Manual calculation mode, User-interface driven

Now switch to Manual calculation mode, clear the immediate window and select D21.

Increment D21 by 1: the result is

  • The UDFs are not executed (nothing in the Immediate Window).
  • B2 and E2 stay the same colour.

Page Down and then Page Up (to refresh the Excel window):

CFEx1_3

  • The UDFs are executed once.
  • Cell E2 changes colour because it directly refers to cell D21 which is now Odd.
  • Cell B2 has correctly NOT changed colour because the conditional format is driven by cell B2 itself, which has not yet changed because it has not yet recalculated.

Now press F9:

The UDFs are executed once and cell B2 changes colour.

Experiment 3: The effect of refreshing the screen with Page Up and Page Down

Now increment Cell D21 again so that the status bar shows Calculate.

Press page Up Page Down repeatedly: the immediate window shows that the UDFs execute each time the screen gets refreshed with Page Up.

Now Press F9 to recalculate:

  • Excel 2013 Page Down Page Up does not execute the UDFs
  • Excel 2010 and 2007 does execute the UDFs once for each Page Down Page Up, even though it does not need to.

Experiment 4: Recalculating but with conditional formats scrolled out of sight.

  • Clear the immediate window.
  • Scroll the Excel window so that row 15 is the first row showing
  • Increment cell D21 by 1
  • Press F9 to recalculate, or Ctrl/Alt/F9 to Full Calculate

The immediate window shows nothing: the conditional formats have NOT been executed and will not be until you Page Up to make them visible.

(Note: if you only scroll so that the first row is row 3 the conditional formats DO get exceuted: looks like Excel is using about a 12 row buffer!)

Conclusions from Experiment 1.

  • Conditional formats are executed when the cell containing the conditional format gets repainted.
  • Conditional Formats are not executed at a calculation unless they are on the visible prtion of the screen.
  • Excel 2013 looks a bit over-enthusiastic in Automatic Calculation mode, but smarter in Manual Calculation mode than Excel 2007/2010.

In the next post I will explore what the performance impact of conditional formats is, and what is the impact of setting Application.Screenupdating=False and Worksheet.EnableFormatCondtionsCalculation=False.

Posted in Calculation, Formatting, UDF, VBA | Tagged , | Leave a comment

Using ENVIRON to find the XLB and QAT files

I am currently updating the FastExcel profiler to run with 64-bit Excel. This involves the rather tedious conversion of a large number of Windows API statements to use conditional compilation, VBA7 and WIN64.

Whilst doing this I discovered the VBA ENVIRON function, which gives you an easy way to get some information about the, well, environment.

For example I wanted to show the size of the XLB file, which stores toolbar customisations even in Excel 2007 and later, and the QAT file which stores QAT customisations. The reason for this is that some poor coding practices cause the size of these files to ballon and become corrupt: then Excel starts crashing but does not tell you why!

You could do this by hardcoding the paths to the files in your code, but thats a bad idea because the paths are different for different versions of Windows.
Or you could do this by using Windows API calls to find the directories, and handle the 32-bit/64-bit coding etc.

Or you can use ENVIRON, which is MUCH easier!

The path to the XLB file under Windows 7 on my system is something like:
E:\Users\your username\Appdata\Roaming\Microsoft\Excel\Excelnn.xlb Using the ENVIRON function in a Windows and Excel version-independent way it looks like this:

strGetXLBPath = Environ(“AppData”) & “\Microsoft\Excel\Excel” & CStr(CLng(Val(Application.Version))) & “.xlb”
kXLBSize=FileLen(strGetXLBPath)

The path to the QAT file under Windows 7 on my system is something like:
E:\Users\your username\AppData\Local\Microsoft\Office\Excel.QAT
And using ENVIRON in VBA:

strGetQATPath = Environ(“LocalAppData”) & “\Microsoft\Office\Excel.qat”
kQATSize=FileLen(strGetQATPath)

My XLB file is currently about 12KB and my QAT is less than 1KB, and I reckon anything over about 30KB is asking for trouble.

If they get corrupt you can delete or rename these files and Excel will happily recreate fresh copies (But of course you lose your customisations).
The easiest way to navigate to the directories is to enter %AppData% or %LocalAppData% in the Windows search programs and files box (Windows button).

Other things I use ENVIRON for include:

  • Getting the path to the Temp Files folder using ENVIRON(“TEMP”)
  • Getting the number of processors using ENVIRON(“NUMBER_OF_PROCESSORS”)
  • Getting the computer name using ENVIRON(“COMPUTERNAME”)

If you want to see all the environment variables active on your system (in the VBE Immediate window) you can use this code which I found on StackOverflow


Sub EnumSEVars()
 Dim strVar As String
 Dim i As Long
 For i = 1 To 255
 strVar = Environ$(i)
 If LenB(strVar) = 0& Then Exit For
 Debug.Print strVar
 Next
 End Sub

You can find 2 excellent articles on ENVIRON here (Win XP) and here (Win 7 8).

OK so how many of you use ENVIRON, or like me you did not even know it existed?

Posted in VBA | 5 Comments

Pivot Table Sort is Too Clever

I created a pivot table from a list containing 3 character IDs, then used the Pivot Table field pulldown to sort it.

PivSort1

The resulting sorted list looks like this:

PivSort2

Looks like Pivot Table sort recognises 3 character abbreviations for

  • The day of the week
  • Months
  • A quarter given as the initial letter of the month

And decides to sort them into (some kind of mangled) time sequence and place them before the unrecognised 3-character IDs!

And I can’t find a way to make it do a proper sort!

Any ideas how you can make Excel do this in a not-so-clever but more sensible way?

Thanks to Debra, Alastair and Rory for telling me how to do it.

Posted in Uncategorized | 6 Comments

Finding missing items in lists: VLOOKUP vs COMPARE.LISTS performance and ease of use

Returning to the subject of finding the missmatches between 2 lists I want to compare using VLOOKUP with using SpeedTools COMPARE.LISTS.

Test Data

My test data consists of 2 lists of 500000 6-digit numbers. Most of these numbers match, but 5000 of them are different. The lists are not sorted. What I want to do is:

  • Filter out the missmatches showing ** for each miss
  • Count the missmatches
  • Produce a list of the missmatches

The first list is in A2:A500001 and the second list is in D2:D500001

Using SpeedTools COMPARE.LISTS

(If you want to try this on your PC you can download a full-featured trial version of SpeedTools, and you can download the sample data and examples of using COMPARE.LISTS here.).

COMPARE.LISTS allows you to control what kind of output you want from the comparison.

  • A count of either the matches or the miss-matches
  • Either True/False or Blank/** (** means not found)
  • A count of the matches and a list of the matches
  • A count of the miss-matches and a list of the missmatches

To get just the count of miss-matches you enter COMPARE.LISTS into a single cell as an ordinary (non-array) formula:

=COMPARE.LISTS(D2:D500001,A2:A500001,3)

This formula looks for each of the cells in D2:D500001 in the list A2:A5000001 and counts the number of items that can’t be found (5000 in this case).

And it only takes 0.6 seconds on my PC!
Thats fast enough for you to add the formula as a safety check that all items match.

To get a count and a list of the missmatches you enter the same formula as a multi-cell array formula (select a vertical range of cells, type the formula in the formula bar and press Control-Shift-Enter). The count appears in the first row and the following rows conatin the list of missing items. And it only takes 0.7 seconds on my PC!

To filter out the rows containing the missmatches you enter the following formula into the 500000 cells in E2:E500001 as a multi-cell array formula, and then filter for **:

{=COMPARE.LISTS($D$2:$D$500001,$A$2:$A$500001,2)}

This formula checks each of the cells in D2:D500001 against the range A2:A500001 and returns either blank for a hit or ** for a miss to the 500000 cels in E2:E500001. And it only takes 0.9 seconds on my PC!  Thats fast enough for you to make corrections and recalculate until all the errors are fixed.

You can also get counts and lists of matching items as well as missing items.

And there is an option to a case-sensistive text compare if you need to find missmatches caused by upper-lower case differences.

Using VLOOKUP

I can use the unsorted (exact match) range lookup option of VLOOKUP for each number I want to check, and it will return #N/A if it can’t be found in the other list: or I can check for the #N/A error and show ** for the missmatches.

=IF(ISERROR(VLOOKUP(D2,$A$2:$A$500001,1,FALSE)),"**","")
and copy down for 500000 rows.

This works, but its very slow: it takes over 500 seconds on my quad-core machine even with multi-threaded calculation, and on a single core machine it would be up to 4 times slower!
But thats not surprising if you think about how many MXIPS (millions of eXcel instructions per second) this is using.

Each VLOOKUP that finds a match is doing a linear search and on average is comparing with 250000 rows (and the ones that don’t have  a match are comparing all 50000 rows): so for 500000 VLOOKUPs thats roughly 500000 x 250000 = 125000 million compares in 500 seconds = 250 MXIPS.

Then I can use autofilter to filter on ** to show only the rows with missmatches.
and use COUNTIF to count the missmatches:

=COUNTIF($E$2:$E$500001,"~**")

Because COUNTIF treats * as a wild-card character I need to add a ~ to stop this happening.

Using VLOOKUP this way works OK for small amounts of data but is just not practical for large numbers of rows.

Its probably possible to create an array formula that just gives the count of miss-matches but I expect it would be too slow to be useful.

And I am sure someone cleverer than me can create an array formula to find case-sensitive missmatches.

Conclusion

My objective for COMPARE.LISTS was to create an easy-to-use and fast function that enabled you to quickly find and fix data missmatches. The performance has exceeded my expectations.

Posted in Calculation, Lookups, UDF, XLL | Leave a comment