Archive for the ‘VBA’ Category

SpeedTools Beta 3 – Win a SpeedTools Coffee Mug and SpeedTools License

February 5, 2013

FastExcel SpeedTools Beta 3

FastExcel SpeedTools Beta 3 is a state-of-the-art set of tools to help you speed up calculation of slow Excel workbooks.

Download the 30-day trial of FastExcel SpeedTools Beta 3

Download the SpeedToolsHelp file.
(you may need to unblock the downloaded .CHM file – Right-Click->Properties->Unblock)

Win one of 20 exclusive FastExcel SpeedTools coffee mugs plus a free SpeedTools License for the best 20 Beta Test reports submitted before the end of March 2013

The mug will look something like this!

Send your Beta Test reports to Charles@DecisionModels.com

Good beta test reports include the following:

  • Windows Version used
  • Excel Version used
  • SpeedTools functions tested
  • Bugs found (include enough information to enable duplication)
  • Positive and negative comments (likes & dislikes)
  • Documentation & Help file problems

You can also post in the SpeedTools Beta Test Google Groups Forum

Supercharge Excel’s calculation engine with SpeedTools

With FastExcel V3 SpeedTools you can calculate what you need, when you need, faster:

  • 90 Powerful faster-calculating functions to unblock your calculation bottlenecks.
  • New Calculation methods and modes give you greater control of calculation.
  • FastExcel V3 high-resolution timers so that you can accurately compare and contrast the calculation performance of your formulae, UDFs, worksheets and workbooks.

Choose just the speed up components you need with 4 separate SpeedTools products;

  • Or from the Toolbar in Excel 2003 and earlier:

Supports Excel 2013 and previous versions

  • Fast Multi-threaded calculation with Excel Versions 2013 64 and 32 bit, 2010 64 and 32 bit, 2007
  • Excel 2003, 2002, 2000 also supported
  • Fast calculating functions written in C++ using the XLL interface
  • Windows 8, Windows 7, Vista and XP

Want to know more?

Download the SpeedToolsHelp file
(you may need to unblock the downloaded .CHM file – Right-Click->Properties->Unblock)

Volatile Dependencies, Indirect Dependencies, False Dependencies – When Dependencies Don’t Work the Way You Think They Should

January 20, 2013

Its always a convenient shorthand to say that UDFs and formulas are recalculated when one of their arguments (or a precedent further upstream in the calculation chain) changes.

But in fact that turns out to be a bit of an oversimplification of how Excel works.

The Test Setup

I have 2 UDFs in a standard VBA module:

Depends1

I have used F9 in the VBE to switch to debug mode whenever either of these UDFs execute.
The first UDF (Depends) has 2 arguments (Arg1 and Arg2), but only the first of them (Arg1) is actually used by the UDF. The second UDF (Depends2) uses both the arguments.

The Excel sheet has 2 sets of data for Arg1 and Arg2 and then calls both the UDFs. Calculation is set to Automatic.

Depends2The result of Depends is 6, and of Depends2 is 24.

When you press F9 nothing happens because nothing has changed to cause a recalculation.

Changing Upstream Precedents

  • When you change cell A2 from 1 to 2 the Depends2 UDF calculates first and then Depends calculates second (assuming you entered the Depends formula in D3 before the Depends2 formula in D6 – Excel calculates formulas last entered first calculated unless this sequence gets changed by dependencies or other factors).
    The values change from 24 to 25 and from 6 to 7.
  • If you change cell A2 from 2 to 2 nothing happens – Excel recognises that nothing has changed.
  • When you change B2 from 5 to 50 both UDFs recalculate, even though Depends does not need to since its result is not dependent on B2.

I call the Arg2 dependency in the first UDF (Depends) a False Dependency since its not actually needed.

Volatile Dependencies

Things work differently if you make one of the dependencies volatile. Lets change cell B2 to =RAND()*100

As expected both UDFs recalculate.

Now press F9 to recalculate again without changing anything else.

Depends2 recalculates, but Depends does NOT recalculate even though a value in Arg2 of the Depends UDF has changed.

In other words if the False Dependency is Volatile it is ignored in a recalculation.
This also happens with built-in Excel functions like INDEX().
If A1 contains =NOW(), and A2:A5 contain the numbers 2 to 5 then

  • =INDEX(A1:A5,1,1) is directly dependent on volatile cell A1 and will always be recalculated.
  • =INDEX(A1:A5,3,1) is only indirectly dependent on volatile cell A1 and will NOT always be recalculated, but it will be recalculated once if for example cell A5 is changed even though the answer will not change

I call Volatile False Dependencies Indirect Dependencies.

EVALUATE and Volatile Dependencies

Stephen Gersuk discovered what looks like another bug with the EVALUATE method and volatile dependencies.

If you have a UDF like this:

Function MySum2(r As Range) As Double
MySum2 = Evaluate("sum(" & r.Address(External:=True) & ")")
End Function


then it does not get recalculated when it has a volatile precedent and you press F9.
So this case gives you the wrong answer, because its not a true False Volatile Dependency: the result really does depend on the argument.

You can bypass this bug by adding anything that references the Value of a cell in the argument:

Function MySum2(r As Range) As Double
If IsEmpty(r) Then Exit Function
MySum2 = Evaluate("sum(" & r.Address(External:=True) & ")")
End Function

But just referenceing properties of the range object is not sufficient:

Function MySum3(r As Range) As Double
Dim strAdd As String
strAdd = r.Address(External:=True)
MySum3 = Evaluate("sum(" & strAdd & ")")
End Function

MySum3 has the same problem.

False Dependencies and Calculation Sequence

It has been suggested that you can use False Dependencies to control the sequence in which Excel calculates formulas.

This is a dangerous idea because false dependencies on uncalculated cells cannot be recognised by Excel since it does not get a chance to discover that they are uncalculated.

Conclusion

  • Yet more reasons to avoid Volatile Functions!
  • Another EVALUATE bug!

Do you have any bad experiences with volatile functions?

 

The SpeedTools FILTER.IFS Function Design Part 2: Excel Data Types – When is a Number a String?

January 14, 2013

Excel Data Types

Excel has only  4 or 5 native data types:

  • Numbers (which can be formatted as Dates, Times, Currency, Integers, Doubles etc, but are all held internally as floating point doubles)
  • Strings (Text including zero length strings like “”)
  • Booleans (True or False)
  • Errors (#N/A, #DIVO etc)
  • Empty (which annoyingly is only partly supported -  for instance you can’t return it from a function or a formula)

You can format all these data types in lots of different ways so that they look different, but a Cell’s underlying value is always going to be one of these types.

And unlike most Database systems Excel allows the cells in a column to contain multiple data types.
This can lead to problems: the most frequent one being a column of numbers some of which have been entered as text strings and some as real numbers. Usually you can visually see them because the numbers that are text are left-aligned in the cell and the real numbers are right-aligned.

Numbers as text can arrive in Excel in various ways:

  • Start by entering a ‘ followed by the number
  • Format the cell as text
  • Data imported from external sources

Sorting Columns containing multiple data types

When Excel sorts data contining different data types it uses this relationship between types:

Numbers<Strings<Booleans<Errors

Empty cells are always sorted last, both in Ascending and Descending sorts!
When you sort data containing numbers stored as text strings Excel asks you if you want to sort Text numbers as text or as numbers. Usually its better to sort text numbers as text rather than risk confusing any subsequent operation that relies on things being properly sorted.

Comparing Data Types

If you use a simple formula (=A6<A5)  to compare data types you get this:

DataTypes1

You can see that XYZ is >= the empty cell above it, but ABCD is <XYZ.
Numeric 1234 in A9 is less than text string 1234 in A8
A12 entered as a ‘ is a zero-length string and is > the number in A11.
The errors in A16:A18 propagate so you can’t see how Excel compares them.

So if you get rid of the error cells, sort the data and change the formula so its looking for A6>A5 you get this:

DataTypes2

So the formula comparison precedence rules are the same as the Excel sorting rules, except for empty cells!

FILTER.IFS Data Type Comparison Operators

Its useful to be able to filter by data type (although the standard Excel Filter command does not have this option), so I added some type filtering operators:

  • #ERR – filters all the error cells
  • #TXT – filters all the string/text cells
  • #N – filters all the number cells
  • #BOOL – filters all the True/False cells
  • #EMPTY – filters all the empty cells
  • #ZLS – filters all the cells containing a zero-length string
  • #TYPE – filters all the cells that have the same data type as the first cell in the filtered range
  • #BLANK – filters all the cells that contain one or more blanks or spaces

You can prefix these operators with ¬ to filter out everything that does NOT match the data type.

And you can have a list of multiple filtering operators: {“¬”,”#EMPTY”,”#ZLS”,”#BLANK”,”#ERR”} would exclude empty cells, cells with zero length strings or blanks, and cells with errors.

Here is an example:

DataTypes4

The FILTER.IFS formula is =FILTER.IFS(0,$A$6:$B$17,1,D$5) entered as an array formula (Control-Shift-Enter) and copied across.

  • The 0 says there are no sorted Criterion columns (because the type filters don’t care if the data is sorted or not).
  • $A$6:$B$17 gives the range to be filtered
  • 2 gives the column within the data range to be returned as the answer
  • 1 gives the column within the data range to be filtered using the criterion
  • D$5 gives the cell containing the criterion itself

Handling Data Types with the Relational Operators <,<=,>,>=,¬=

Suppose you create a FILTER.IFS formula  like this: =FILTER.IFS(1,$A$6:$B$17,2,1,”<1235″)

The criterion says less than 1235, but which 1235 – the numeric one or the string one or both?

I don’t think there is neccessarily a “correct” answer to this, so I invented a rule!

If the Criterion value can be converted into more than one data type (in this case a string and a number) choose the same data type as the first cell in the column.

In this case the first cell is a number, so FILTER.IFS chose to use numeric 1235, which results in a single result, the 1 from row 6.

Because the data is sorted the binary search routine has to use a single datatype, so looking for both the string 1235 and the numeric 1235 is not an option.

But if the data is NOT sorted a linear search can find both: so you can tell FILTER.IFS to compare using ALL the avialable datatypes by using an & prefix.

=FILTER.IFS(0,$A$6:$B$17,2,1,”&<1235″)

you get ALL the data which is less than numeric 1235 AND all the data that is less than string 1235.

Of course if you don’t use any of the Criterion operators and it finds only the matching data type (=FILTER.IFS(1,$A$6:$B$17,2,1,”1235″) or =FILTER.IFS(1,$A$6:$B$17,2,1,1235)

Conclusion

Using mixed data types with relational operators can be tricky – sometimes its difficult to work out what Excel is doing.
A drawback of following the same kind of syntax as SUMIFS (a string containing both the relational operator and the value) is that there is no clear datatype choice.

But I was not sure that the previous FILTER.IFS design, which could give different results for sorted and unsorted data, made sense, so I changed it so that sorted and unsorted data gave the same results and added the & prefix :
What do you think?

The SpeedTools FILTERIFS function: Design and Implementation Part 1

December 21, 2012

Excel users have been using SUMPRODUCT and array formulas to create multiple-condition formulas for many years. This is a powerful technique, but can be painfully slow with large amounts of data. Pivot Tables and Excel 2013′s PowerPivot can provide good solutions in some instances, and the introduction of SUMIFS in Excel 2007 gave a fast alternative for some scenarios.

But there is still a need for a powerful, dynamic function that can perform better than SUMPRODUCT/ array formulas, so let me introduce my attempt at creating one : FILTERIFS.

FILTERIFS Design Objectives

  • Speed of calculation – multi-threaded, exploit sorted data and clustered data.
  • Extended criterion types to include AND/OR, Lists, Wild Card Patterns, Regular Expressions, Calculated Columns and Arrays etc.
  • Dynamic calculation in the same way as other Excel functions.
  • Extend multiple condtions to many more functions by outputing an array to other functions such as SUM, MEDIAN, LISTDISTINCTS, VSORT etc, or directly as a multi-cell array formula.

The original implementation was done using a VB6 automation addin, but lack of multi-threading and 64-bit support in VB6 lead me to re-implement as a C++ XLL.

So how do you make it fast?

The idea is to process each criterion in turn using only the rows that meet all the criteria processed so far, thus avoiding the SUMPRODUCT/array formula approach of evaluating all the criteria for all the rows.
Criteria operating on sorted columns are processed first using a fast High-Low binary search modified for relational operators.
Non-sorted columns and criterion types like Regex are then processed using linear search in a sequence designed to minimise data transfer/coercion time.
And using a C++ XLL allows multi-threading and fast execution.

FILTERIFS Syntax

The syntax uses a similar approach to SUMIFS to pass the criterion as a string concatenation of a relational operator and a value. Because the value is passed as a string FILTERIFS has to do some datatype conversions of the value to match the datatype of the criterion column (and hopefully avoid some of the SUMIFS bugs in this area).

FILTERIFS( nSortedCols, InputRange, ReturnCol, CriteriaColumn1, Criteria1,
CriteriaColumn2, Criteria2, … , ["#OR#", nsortedCols,] CriteriaColumnx, Criteriax, …)

nSortedCols gives the number of columns which are sorted in the InputRange

InputRange is a range reference to the data containing the sorted columns and return column.
The data can contain a header row of names for the columns.

ReturnCol is the header name or number of the column within InputRange to return results from.

Criteria Column gives either the name/number of a column in InputRange, or a range reference to an independent column, or an array or an expression returning a column of data to be used as the criterion column.

Criteria is the expression used to filter the criterion column.
This can be a relational operator (=, >=, <=, >, <, ¬=, ~, ¬~, ~~) ( ¬ means NOT, ~ means LIKE, and ~~ means Regex) and value.
It can also be a LIST of alternatives to look for, given either as an array ({“FL”,”NY”,TX”} or with a relational operator {“~ABC*”,”~DEF*}) or as a range reference.

#OR# allows you to have multiple alternative sets of criteria.

FILTERIFS Components

To deliver this fairly complex set of capabilities the function is broken down into a number of component blocks. These are the major ones:

  • Handle any header row column names & translate column names and numbers to column indexes.
  • Parse and analyse the criterias, storing the result in an array of Criterion structures
  • Data Type detection and type-casting of the criteria values
  • Find the optimum sequence to process the criteria
  • Row-Pairs class to store first-row last-row pairs for the rows that meet the criteria. Methods for this class include Append, Merge, Condense, CountPairs, CountRows etc.
  • High-Low binary search for the sorted criteria
  • Translate High-Low to rowpairs using the relational operators
  • Determine optimum data-transfer/coercion strategy and sequencing for the non-sorted criteria
  • Linear Search on row-pairs for non-sorted criteria
  • Comparison functions for the various Criterion operators.
  • Conversion of row-pairs to results

These components currently result in just under 5000 lines of code.

FILTERIFS Status

As at December 2012 the function is coded and the first phase of testing has been completed. It has taken considerably longer than planned, mainly because the VB6 version made extensive use of EVALUATE, which turned out not to be allowed to multi-thread in C++ and so I had to redesign most of the approach for non-sorted columns.

There is still some performance testing, refactoring and rework to be done but the target is to start Beta3 in early January 2013.

(Assuming that the Mayan calendar is wrong in predicting the end of the world today Friday 21 Decembery 2012).

Writing Efficient UDFs Part 12: Getting Used Range Fast using Application Events and a Cache

December 5, 2012

In the previous post I suggested that one good way to speed up retrieval of the Used Range last row would be to use a Cache and the AfterCalculate Application event.

I have now tested this approach and it works well: here is the code for the demo function GetUsedRows3:
Option Explicit
'
' create module level array for cache
'
Dim UsedRows(1 To 1000, 1 To 2) As Variant
Public Function GetUsedRows3(theRng As Range)
' store & retrieve used range rows if Excel 2007 & later
Dim strBookSheet As String
Dim j As Long
Dim nFilled As Long
Dim nRows As Long
' create label for this workbook & sheet
strBookSheet = Application.Caller.Parent.Parent.Name & "_" & Application.Caller.Parent.Name
If Val(Application.Version) >= 12 Then
' look in cache
For j = LBound(UsedRows) To UBound(UsedRows)
If Len(UsedRows(j, 1)) > 0 Then
nFilled = nFilled + 1
If UsedRows(j, 1) = strBookSheet Then
' found
GetUsedRows3 = UsedRows(j, 2)
Exit Function
End If
Else
' exit loop at first empty row
Exit For
End If
Next j
End If
' find used rows
nRows = theRng.Parent.UsedRange.Rows.Count
'
If Val(Application.Version) >= 12 Then
' store in cache
nFilled = nFilled + 1
If nFilled <= UBound(UsedRows) Then
UsedRows(nFilled, 1) = strBookSheet
UsedRows(nFilled, 2) = nRows
End If
End If
'
GetUsedRows3 = nRows
End Function
Sub ClearCache()
'
' empty the first row of the used-range cache
'
UsedRows(1, 1) = ""
End Sub

Note: there is no error handling in this code!

Start by defining a module level array (UsedRows) with 1000 rows and 2 columns. Each row will hold a key in column 1 (book name and sheet name) and the number of rows in the used range for that sheet in that book in column 2. I have assumed that we will only cache the first 1000 worksheets containing these UDFs!
The key or label is created by concatenating the name of the parent of the calling cell (which is the worksheet) to the name of the parent of the parent of the calling cell (which is the workbook containing the sheet).
Then loop down the UsedRows array looking for the key, but exit the loop at the first empty row.

If the key is found, retrieve the number of rows in the used range from column 2, return it as the result of the function and exit the function.

Otherwise find the number of rows in the used range,  store it in the next row of the UsedRange cache and return it as the result of the function.

Only for Excel 2007 or later

You can see that the function only operates the cache for Excel 2007 and later versions. There are two reasons for this:

  • Excel 2003 and earlier have a maximum of 64K rows so finding the used range is relatively fast anyway.
  • Only Excel 2007 and later have the AfterCalculate event which will be used to empty the cache after each calculate.

We need to empty the cache after each calculate because the user might alter the used range and so the safe thing to do is to recreate the cache at each calculation.
AfterCalculate is an Application Level event which is triggered after completion of a a calculation and associated queries and refreshes. (A BeforeCalculate event would be even more useful but does not exist!)

Using the AfterCalculate Application Event.

Chip Pearson has an excellent page on Application Events. I always consult it when I need application events because I can never remember exactly how to do it!

First I added a Class Module called AppEvents with code like this:
Option Explicit
Private WithEvents App As Application
Private Sub Class_Initialize()
Set App = Application
End Sub
Private Sub App_AfterCalculate()
ClearCache
End Sub

Then I added some code to the ThisWorkbook module:

Option Explicit
Private XLAppEvents As AppEvents
Private Sub Workbook_Open()
Set XLAppEvents = New AppEvents
End Sub

This sets up the hooks that are needed for Application level events. Quite a lot of code just to run the ClearCache sub after each calculation!
ClearCache just empties the first key in the Cache so that the find loop in GetUsedRows3 exits straight away.

This code is ignored in Excel 2003 and earlier: since the AfterCalculate event does not exist it never gets called but still compiles OK.

Performance of GetUsedRows3

For 640K rows of data 1000 calls to GetUsedRows3 takes 66 milliseconds. The original CountUsedRows function took 33 seconds.
Thats a speedup factor of 500!

Writing Efficient UDFs Part 11 – Full-Column References in UDFs: Used Range is Slow

December 2, 2012

Excel users often find it convenient to use full-column references in formulas to avoid having to adjust the formulas every time new data is added. So when you write a User Defined Function (UDF) you can expect that sooner or later someone will try to use it with a full-column reference:

=MyUDF(A:A,42)

When Excel 2007 introduced the “Big Grid” with just over 1 million rows it became even more important to handle these full-column references efficiently. The standard way to handle this in a VBA UDF is to get the INTERSECT of the full-column reference and the used-range so that the UDF only has to process the part of the full-column that has actually been used. The example VBA code below does this intersection and then returns the smaller of the number of rows in the input range and the number of rows in the used range.

Public Function GetUsedRows(theRng As Range)
Dim oRng As Range
Set oRng = Intersect(theRng, theRng.Parent.UsedRange)
GetUsedRows = oRng.Rows.Count
End Function

The parent of theRng is the worksheet that contains it, so theRng.Parent.UsedRange gets the used range of the worksheet you want.

Two problems with this technique are:

  • Getting the Used Range can be slow.
  • The XLL interface does not have a direct way to access the Used Range, so you have to get it via a single-thread-locked COM call. (More on this later).

So just how slow is it to get the used Range?

I created a very simple UDF and timed the calculation of 1000 calls to this UDF for filled used ranges of between 10K rows and 640K rows.

Public Function CountUsedRows()
CountUsedRows = ActiveSheet.UsedRange.Rows.Count
End Function

It turns out that the time taken to execute this UDF is a linear function of the number of used rows in the used range.

Used_Range_Times

And its quite slow, 1000 calls to this UDF with 640K rows of data takes 33 seconds!

When the used range is small you won’t notice the time taken, but for large used ranges with the big grid you certainly will. And the problem is that your UDF will do this check on every range that is passed to the UDF, even if its not really needed.

Colin points out that what affects the time is actually the number of cells containing data or formatting (or that previously contained data or formatting) rather than the last cell in the used range.

Speeding up finding the used range.

So you could start by only doing the used-range check when theRng parameter has a large number of rows:

Public Function GetUsedRows2(theRng As Range)
Dim oRng As Range
If theRng.Rows.Count > 500000 Then
Set oRng = Intersect(theRng, theRng.Parent.UsedRange)
GetUsedRows = oRng.Rows.Count
Else
GetUsedRows = theRng.Rows.Count
End If
End Function

This example only does the check if the user gives the UDF a range referring to more than half a million rows.

Another, more complicated, way of minimising the time is to store the number of rows in the used range in a cache somewhere and retrieve it from the cache when needed. The tricky part of this is to make sure that the used-range row cache always is either empty (in which case go and get the number) or contains an up-to-date number.

One way of doing this would be to use the Application AfterCalculate event (which was introduced in Excel 2007) to empty the cache. Then only the first UDF that requested the used range for each worksheet would use time to find the used range, and (assuming that the calculation itself did nothing to alter the used range) the correct number would always be retrieved.

The equivalent for Excel versions before Excel 2007 would be to use the Application SheetCalculate event to empty the cache for that particular worksheet. This technique would be less efficient since a worsheet may well be calculated several times in each calculation cycle.

As Colin points out, if you want to find the last row containing data it is faster to use Range.Find when you have many cells containing data.
Note that you can only use Range.Find in UDFS in Excel 2002 and later, and you cannot use the Find method at all from an XLL except in a command macro or via COM.

Public Function CountUsedRows2()
CountUsedRows2 = ActiveSheet.Cells.Find(What:="*", LookIn:=xlFormulas, SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row
End Function

So have you got any better ideas on how to process full-column references efficiently?

Getting Cell data with VBA and C++: Benchmarking Excel 2013

November 9, 2012

There is a new kid on the block: Excel 2013. So its time to see how it shapes up for VBA performance against its predecessors. Along the way I will try to answer these questions:.

How does Excel 2013 VBA performance compare with previous versions?

Is 64-bit VBA faster or slower than 32-bit VBA?

Is there an optimum block size (number of cells) for getting cell data with VBA or C++?

How much faster is C++ than VBA when reading data from Cells?

Excel 2013 VBA Benchmark

Frequently the bottleneck in VBA processing time is reading and writing to Excel cells. So my Variant_Benchmark times this for a variety of cell sizes.

It turns out the relationship between time taken and the number of cells you get in a single block is pretty much linear over small ranges:
Time = overhead + number_of_cells * Time_per_cell

Running this test for on my Dell Studio XPS desktop (2.9 MHZ) for all the versions of Excel I currently have installed gives this table, (slightly overcoloured by the Excel 2013 quick analysis tool – but Quick Analysis is great for charting and formatting idiots like me).

The first 2 columns give times in milliseconds for reading and writing a single block of 512 cells, and the second 2 columns give times in milliseconds for writing a single cell 512 times.

Two things are immediately obvious:

  • Reading and writing cell-by-cell is at least an order of magnitude slower than reading and writing a large block of cells in  a single call.
  • Writing data back to Excel is much slower than reading data from Excel.

Looking at the variations by Excel versions it is interesting to see that Excel 97 is the fastest version for cell-by-cell but the slowest version for block by block.

Here is a bar chart (again produced by Excel 2013 Quick Analysis) for reading and writing a block of 512 cells:

Read times for a 512 block have been decreasing since XL 2003: and the current champion reader is Excel 2013 64-bit.
But Write times for a 512 block have been increasing since  XL 2000 – the current champion writer.

Looking at cell-by-cell you get this bar chart:

Reading and writing cell-by-cell just goes from bad (Excel 97) to worse (Excel 2010 32).
But its interesting to note that the 64-bit versions are faster than the 32-bit versions for 512 cell blocks, but slower for cell-by-cell.

Is there an optimum Block size for VBA data transfer?

Many excel versions ago (Excel 97?) I did an analysis which showed that if you increased the number of cells being read into the variant array too much the time started to increase. So I thought I would look at this again with the latest Excel versions.

So far I have not detected any decrease in efficiency: you can use as large a block of cells as you like (subject to memory limits of course).

What about C++ XLLs?

The XLL equivalent to transferring data from a range of excel cells to a variant array is the COERCE function. This has the additional benefit of letting you know if the attempt failed because of uncalculated cells.

And XLL Coerce is faster than VBA, by a factor of 2 or more.

Coerce does seem to have an optimum block size. Beyond about 1000 cells the time taken suddenly jumps by 25 to 30% in the 32-bit versions, except for XL 2013 64-bit where this does not happen.

Conclusions

  • Read/write using large Ranges is much more efficient than cell-by-cell
  • Excel 2013 VBA read/write speed is comparable to Excel 2010
  • The 64-bit versions are faster for large ranges than the 32-bit versions
  • VBA does not currently appear to have an optimum block size: the larger the range the better.
  • XLL Coerce is a lot faster than VBA
  • XLL Coerce 32-bit has an optimum block size just under 1000 cells

I have not tested .NET Interop with Excel 2013, but I expect its still the same depressing performance story. If you know anything different please let me know!

The SpeedTools Function library – Licensing, Pricing, Piracy and Ecommerce

September 13, 2012

We had a great time spending August in Norfolk (a 30-year tradition) with lots of sailing and walking.

When I got back in September I started wrestling with how to license, price, sell and fulfill the sale of the SpeedTools Function library.
(This is displacement activity from tackling coding the FILTER functions …)

Fight Scope Creep by Splitting!

The first decision was that scope creep (rapidly heading towards 100 additional Excel functions!) had got to the point where I needed to split SpeedTools into a number of products:

  • SpeedTools Calc – The extended calculation methods and options
  • SpeedTools Lookups – The fast Lookup and Comparison functions
  • SpeedTools Filters – The Filtering, Sorting and Distinct/Unique functions
  • SpeedTools Extra – the Math, Logical, Array, Text and Information functions
  • SpeedTools Premium – a bundle containing all the other 4 products

SpeedTools Calc contains all the User Interface stuff and is mostly a VBA addin, and the other 3 products are mostly contained in an XLL.

Choosing an E-Commerce provider

When I started selling FastExcel in 2001 we custom built the website and licensing system that took the orders, priced them, maintained the License database and linked to Worldpay for the credit card processing in multiple currencies, then automatically emailed out the License codes etc. The system has worked pretty well for 11 years but has some serious limitations and would need rewriting to handle multiple products.

So I started looking at the many available E-Commerce providers who handle software sales. I found a useful starting point here.
More research gave me a shortlist of Avangate and FastSpring who both seemed to tick most of my boxes:

  • Reasonable costs
  • Multiple currencies
  • Integration with licensing software
  • Multiple products
  • Volume Licensing
  • Credit Cards (including AMEX), PayPal, Purchase Orders, Money Transfer supported
  • Coupons, promotions, time-limited pricing

Licensing and Piracy

As we all know, its pretty easy to hack a VBA addin, and difficult to prevent casual copying of your XLA/XLAM product. For FastExcel I used an installer that required required an installation password, but I wanted a better defence against casual copying for SpeedTools. I decided to use a License activation system, and so looked at the systems that were already integrated (see FastSpring DRM) with my chosen E-Commerce provider FastSpring.

The system I chose can be integrated with Excel VBA and VB6 addins as well as .NET and C++ stuff. You can embed a time-limited trial license inside your code and then upgrade with one or more licenses. It uses Public/Private key encryption and you can choose what kind of machine binding you want to use.

To make this work you also need a web server for the license database that can dynamically link both to your E-Commerce provider and to your application. I did not want the cost and hassle of hosting and maintaining my own license server and so opted for a system that provides a hosted web license server for you.

If no valid license to any of the products is found then the products do not load. If there is at least one licensed product then the User Interface and Help for all the products is shown, but using for example a MEMLOOKUP function without a Lookups license will make the function return a message saying “No valid license found for this function” to the calling cells.

Is this hacker-proof? Probably not – nothing really is, but I think its more than enough to stop casual copying.

Implementing all this is quite complex because you have to integrate a number of things:

  • Your own Website to sell and link to the Ecommerce provider
  • setting up the E-commerce provider and linking it to the License activation server
  • creating an install package that delivers and installs all the requisite software and files and products
  • integrating the licensing software into your products, including the UI for managing the licenses.

Pricing

Deciding on the right price for your software product is impossible.

I started by looking at what prices other people charge for Excel addins, then considered how useful each product would be to someone who needed the function provided, and how much time and effort had been used to create them. For most of these products I could not find any real competitive products.

This gave me a range of base prices for the different products (currently $29 (Calc) to $69 (Lookups), but this might change!).

The next step was to consider volume discounts. There are 2 main scenarios to consider here for a function library:

  • An individual who wants a license that covers work and home (so 2 PCs)
  • Workbooks that will get used by a number of different people on different Machines.

My current thinking is to start with a very steep discount for 2 licenses (so that buying 2 costs only 30% more than buying 1) and then increase the discount progressively for larger volumes.

Conclusion

It has taken me about 2 weeks work so far, and there is probably another week needed to finish, but thats not too bad considering the complexity of the task and that (hopefully) the system will be in use for the next ten years or so.

So what level do you think I should price at?

And how do you feel about license activation?

 

Comparing Two Lists – VBA UDF shootout between Linear Search, Binary Search, Collection and Dictionary

July 10, 2012

I often run into the problem of having to compare two lists in Excel, to see what items are in the list to look for that can’t be found in the list to look in.

So when I saw Dick Kusleika’s post on Daily Dose of Excel I thought it was time I got my act together and worked out the best way of tackling this problem.

I decided to write a VBA UDF called IsInList2 that called 6 methods:

  • Sorting the the LookIn list and using Binary Search to compare the items in the LookFor list
  • Using linear search of the LookIn list for each item in the LookFor list
  • Create a Collection containing the LookIn list and check it for each item in the LookFor list
  • Create a Dictionary containing the LookIn list and check it for each item in the LookIn list
  • Use an already sorted LookIn List and Binary Search
  • Look for partial matches using InStr

Overview of the IsInList 2 UDF

The IsInList2 UDF is written as an array function returning an array of True/False. Its designed to be called as a multi-cell array function entered in the column next to the LookFor list, so that you can Filter for False to find all the items in the LookFor list that don’t exits in the LookIn list.

For simplicity the UDF is written assuming that both lists are Ranges with at least 2 items: so the first task is to get the values from the Ranges into variant arrays.
(I have ignored the optimisation of using MATCH directly on the Range).

Then the output array is created as the smaller of the calling cells and the the LookFor list

Then if its an exact match the data is either Sorted, added to a Collection or to a Dictionary

Then the function iterates through the LookFor list using the appropriate method subroutine and stores the result in the output array

The QuickSort Sub

Here is the code for a QuickSort of a variant containing an array. The Quicksort would be substantially faster if it was strongly typed and handled a vector rather than a 1-column matrix.

Sub QSortVar(InputValues As Variant, jStart As Long, jEnd As Long)
Dim jStart2 As Long
Dim jEnd2 As Long
Dim v1 As Variant
Dim v2 As Variant
jStart2 = jStart
jEnd2 = jEnd

‘ choose random pivot

v1 = InputValues((jStart + (jEnd – jStart) * Rnd()), 1)
While jStart2 < jEnd2
While InputValues(jStart2, 1) < v1 And jStart2 < jEnd
jStart2 = jStart2 + 1
Wend
While InputValues(jEnd2, 1) > v1 And jEnd2 > jStart
jEnd2 = jEnd2 – 1
Wend
If jStart2 < jEnd2 Then
v2 = InputValues(jStart2, 1)
InputValues(jStart2, 1) = InputValues(jEnd2, 1)
InputValues(jEnd2, 1) = v2
End If
If jStart2 <= jEnd2 Then
jStart2 = jStart2 + 1
jEnd2 = jEnd2 – 1
End If
Wend
If jEnd2 > jStart Then QSortVar InputValues, jStart, jEnd2
If jStart2 < jEnd Then QSortVar InputValues, jStart2, jEnd
End Sub

The Binary Search Function

For simplicity this function is also written to handle a variant containing an array, and would be faster if more strongly typed.
Its not designed as a UDF: its called from IsInList2.

Function BSearchMatch(LookupValue As Variant, LookupArray As Variant) As Boolean

‘ use binary search to find if lookupvalue exists in lookuparray

Dim low As Long
Dim high As Long
Dim middle As Long
Dim varMiddle As Variant
Dim jRow As Long

jRow = 1
BSearchMatch = False
low = 0
high = UBound(LookupArray)
Do While high – low > 1
middle = (low + high) \ 2
varMiddle = LookupArray(middle, 1)
If varMiddle >= LookupValue Then
high = middle
Else
low = middle
End If
Loop

If (low > 0 And high <= UBound(LookupArray)) Then
If LookupArray(high, 1) > LookupValue Then
jRow = low
Else
jRow = high
End If
End If
If LookupValue = LookupArray(jRow, 1) Then BSearchMatch = True
End Function

The Exact Linear Search Function

LMatchExactV is not designed as a UDF: its called from IsInList2.

Function LMatchExactV(LookupValue As Variant, LookupArray As Variant) As Boolean

‘ use linear search to find if LookupValue exists in LookupArray
‘ LookupArray must be a a 2-dimensional variant array of N rows and 1 column

Dim j As Long

LMatchExactV = False
For j = 1 To UBound(LookupArray)
If LookupValue = LookupArray(j, 1) Then
LMatchExactV = True
Exit For
End If
Next j
End Function

The Partial Match Linear Search Function

This function use InStr to check for partial match. It is not designed as a UDF: its called from IsInList2.

Function LMatchInV(LookupValue As Variant, LookupArray As Variant) As Boolean

‘ use linear search and Instr to find if LookupValue is within any value in LookupArray
‘ LookupArray must be a a 2-dimensional variant array of N rows and 1 column

Dim j As Long
Dim strLook As String

LMatchInV = False
strLook = CStr(LookupValue)
For j = 1 To UBound(LookupArray)
If InStr(LookupArray(j, 1), strLook) > 0 Then
LMatchInV = True
Exit For
End If
Next j
End Function

The IsInList2 Array Function

Because the function uses the Dictionary Object from VBScript you need to add a reference to the Microsoft Scripting Runtime.

The Function takes 2 optional parameters which control the method used:

jSorted – which Sort/Lookup?Match method to use

FindExact – True for an exact match, False for a partial match

Public Function IsInList2(LookFor As Variant, LookIn As Variant, Optional jSorted As Long = 0, Optional FindExact As Boolean = True)

‘ jSorted
‘   =0 data not sorted but sort it
‘   =1 data already sorted – use binary search
‘   =-1 use linear search
‘   =2 use collection
‘   =3 use dictionary

Dim nLookFor As Long
Dim nLookIn As Long
Dim nOut As Long
Dim vOut() As Variant
Dim j As Long
Dim coll As New Collection
Dim dict As New dictionary

‘ coerce ranges to values

LookFor = LookFor.Value2
LookIn = LookIn.Value2

‘ get row counts

nLookFor = UBound(LookFor)
nLookIn = UBound(LookIn)
nOut = Application.Caller.Rows.Count

If nLookFor < nOut Then nOut = nLookFor
ReDim vOut(nOut, 1)  ‘ create output array

If FindExact Then
If jSorted = 0 Then
QSortVar LookIn, LBound(LookIn), UBound(LookIn)    ‘ quicksort
jSorted = 1
ElseIf jSorted = 2 Then
On Error Resume Next
For j = 1 To nLookIn
coll.Add LookIn(j, 1), CStr(LookIn(j, 1))   ‘ collection
Next j
ElseIf jSorted = 3 Then
On Error Resume Next
For j = 1 To nLookIn
dict.Add LookIn(j, 1), LookIn(j, 1)    ‘ dictionary
Next j
End If
On Error GoTo 0
End If

For j = 1 To nOut
If Not FindExact Then
vOut(j, 1) = LMatchInV(LookFor(j, 1), LookIn)    ‘instr linear search
ElseIf jSorted = 1 Then
vOut(j, 1) = BSearchMatch(LookFor(j, 1), LookIn)    ‘ binary search
ElseIf jSorted = 2 Then

‘ use collection

On Error Resume Next
Err.Clear
vOut(j, 1) = coll.Item(CStr(LookFor(j, 1)))
If CLng(Err.Number) = 5 Then
vOut(j, 1) = False
Else
vOut(j, 1) = True
End If
ElseIf jSorted = 3 Then
vOut(j, 1) = dict.Exists(LookFor(j, 1))    ‘ Dictionary
ElseIf jSorted = -1 Then
vOut(j, 1) = LMatchExactV(LookFor(j, 1), LookIn)    ‘ exact linear
Else
vOut(j, 1) = CVErr(xlErrValue)
End If
Next j

IsInList2 = vOut    ‘ return output
End Function

Performance Comparison

The performance tests are all done using ranges containing a character followed by 5 digit random integer numbers.

The timings are all in Milliseconds.
LookFor and LookIn give the number of rows.

Linear Search timings on smaller numbers of rows are slightly pessimistic because of a high % of miss-matches
BSearchOnly gives timings for Binary Search on already sorted data.
Partial gives timings for Linear Search with partial matching using InStr.

Because Excel VBA array formulas cannot return more than 64K rows you cannot use the function directly for more than 64K rows of LookFor without using more than one array formula.

LookFor

LookIn

Linear

QSortBsearch

Collection

Dict

BsearchOnly

Partial

2 2

0.29

0.29

0.29

0.46

0.29

0.29

50 50

4.42

4.14

4.05

4.06

3.91

4.43

200 200

12.2

5.59

5.08

4.40

4.49

14.2

500 500

55.5

8.9

7.3

4.9

5.66

68.0

2000 2000

824

27.0

19.9

8.03

12.2

995

50000 50000

-

800

515

252

279

-

50000 200000

-

2583

1290

1044

350

-

50000 1048756

-

14204

6720

23650

-

-

Overall the sequence from fastest method to slowest method is:

  • Dictionary
  • Binary Search Only (but the data has to already be sorted)
  • Collection
  • Quick Sort plus Binary Search
  • Linear Search
  • Partial Linear Search

Conclusion

The best method is to use the VBScript Dictionary object, unless you have more than 500K rows after which the Collection object starts to win.

The coding for Dictionary is very simple and it has an Exists method which Collection lacks.

Below about 2000 rows you probably won’t notice the difference between any of the methods except for Linear search.

So now I just have to try the C++ XLL version … I reckon that will beat Dictionary! (Yes, looks like XLL wins: 1923 millisecs for 50000 in 1048756)

From VBA to C Part 7 – Developing an XLL array function

July 7, 2012

In a post last year I showed how to develop an array version of the VBA AverageTol UDF.
This was a simple and not very useful extensionto the AverageTol function that allowed the function to return an array of results that corresponded to an array of tolerances, here is a slightly more sophisticated version that returns rows or columns depending on the shape of the calling cells:

The VBA Array UDF

Public Function AverageTolM(theRange As Range, theTols As Range) As Variant
Dim vArr As Variant
Dim vArrTols As Variant
Dim nTols As Long
Dim nRowsOut As Long
Dim nColsOut As Long
Dim v As Variant
Dim vt As Variant
Dim d As Double
Dim r As Double
Dim j As Long
Dim k As Long
Dim vOut() As Variant
Dim dTol As Double
Dim lCount As Long
On Error GoTo FuncFail
vArr = theRange.Value2
vArrTols = theTols.Value2
nRowsOut = Application.Caller.Rows.Count
nColsOut = Application.Caller.Columns.Count
'
' create output array Dimmed as calling cells
'
ReDim vOut(1 To nRowsOut, 1 To nColsOut)
'
' initialise to #N/A
'
For j = 1 To nRowsOut
For k = 1 To nColsOut
vOut(j, k) = xlErrNA
Next k
Next j
nTols = UBound(vArrTols)
If UBound(vArrTols, 2) > nTols Then nTols = UBound(vArrTols, 2)
On Error GoTo skip
k = 0
'
' loop on each Tolerance
'
For Each vt In vArrTols
dTol = CDbl(vt)
r = 0#
lCount = 0
k = k + 1
If k > nRowsOut And k > nColsOut Then Exit For
For Each v In vArr
d = CDbl(v)
If Abs(d) > dTol Then
r = r + d
lCount = lCount + 1
End If
skip:
Next v
If r > 0# And lCount > 0 Then
'
' output results vertically or horizontally
' to match calling cells
'
If nRowsOut >= nColsOut Then
vOut(k, 1) = r / lCount
Else
vOut(1, k) = r / lCount
End If
End If
Next vt
'
AverageTolM = vOut
Exit Function
FuncFail:
AverageTolM = CVErr(xlErrNA)
End Function

The XLL Array Function

So today I will do the same thing with the C++ version.

First we need to change the UDF definition (called a signature in C++) so that we can get an array of Tolerances. To keep things simple I am using a vector of doubles for this:

I have also changed the return type to CXlOper to allow the UDF to return Excel error values as well as doubles.

RW12 nRowsIn=0,nRowsOut=0, nTols=0;
COL12 nColsIn=0, nColsOut=0;
double dTemp=0.0,dAvTol=0.0;
RW12 Counter=0;
//
// get Data dimensions
theNumbers->GetDims(nRowsIn,nColsIn);
// get Tolerances count
nTols=theTolerance.size();
//
// get caller dimensions; return #N/A if fails
if (!CXllApp::GetCallerDims(nRowsOut,nColsOut))
return CXlOper::RetError(xlerrNA);
// create output array dimensioned same as caller, filled with #N/A
xloResult.AllocArray(nRowsOut,nColsOut,xlerrNA);
//
// loop on Tolerances
for (RW12 k=0; k<nTols; k++) {
if (k>==nRowsOut && k>=nColsOut) break;
//
// loop thru the value array
dAvTol=0.0;
Counter=0;
for (RW12 i=0; i<nRowsIn; i++) {
dTemp=theNumbers->Cell(i,0);
if (fabs(dTemp)>theTolerance[k]) {
dAvTol += dTemp;
Counter++;
}
}
// output results horizontally or vertically
// to match caller orientation
if (dAvTol != 0.0 && Counter != 0)
nRowsOut>nColsOut ? xloResult.Cell(k,0)= dAvTol/Counter : xloResult.Cell(0,k)= dAvTol/Counter;
}
return xloResult.Ret();
}

New things to notice in this function:

  • The equivalent of using Application.Caller to get the dimensions of the calling range is CXllApp::GetCallerDims
  • I am using AllocArray to create an xlOper result array of the same dimensions as the calling range and initialised to #N/A
  • An easy alternative is to dimension the xlOper result array to the same dimensions as the theTolerance array and allow Excel to truncate the output or pad with #N/A. But that does not work when you want to control what to use as the outout array padding character.
  • break is the C++ equivalent to Exit For
  • ? is the C++ Conditional Operator, sort of like IIF in VBA (but faster!):
    logical_expression ? expression_if_true : expression_if_false ;
  • Using the function for an array of 5 tolerances takes 15 millisecs compared to 26 millisecs for 5 separate calls to the non-array version

Conclusion

  • Its fairly straightforward to create XLL functions that return arrays
  • The array function performs better than the non-array version, mainly because the data is only passed as an array-type Oper (Value type P) once for multiple results.

Follow

Get every new post delivered to your Inbox.

Join 34 other followers