@FastGTPSA/@FastGTPSA! Macros

Speed up evaluation of expressions containing TPSs, transparent to other types

The thread-safe macros @FastGTPSA/@FastGTPSA! can be used to speed up evaluation of expressions that may contain TPSs. Both macros are completely transparent to all other types, so they can be prepended to any existing expressions while still maintaining type-generic code. Any functions in the expression that are not overloaded by GTPSA will be ignored. Both macros do not use any -ffast-math business (so still IEEE compliant), but instead will use a thread-safe pre-allocated buffer in the Descriptor for any temporaries that may be generated during evaluation of the expression.

The first macro, @FastGTPSA can be prepended to an expression following assignment (=, +=, etc) to only construct one TPS (which requires two allocations), instead of a TPS for every temporary:

julia> using GTPSA, BenchmarkTools
julia> d = Descriptor(3, 7); x = vars(d);
julia> @btime $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; 6.180 μs (20 allocations: 11.88 KiB)
julia> @btime @FastGTPSA $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; 5.449 μs (2 allocations: 1.94 KiB)
julia> y = rand(3); # transparent to non-TPS types
julia> @btime $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; 20.067 ns (0 allocations: 0 bytes)
julia> @btime @FastGTPSA $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; 21.464 ns (0 allocations: 0 bytes)

The second macro, @FastGTPSA! can be prepended to the LHS of an assignment, and will fill a preallocated TPS with the result of an expression. @FastGTPSA! will calculate a TPS expression with zero allocations, and will still have no impact if a non-TPS type is used. The only requirement is that all symbols in the expression are defined:


julia> t = ComplexTPS64(); # pre-allocate
julia> @btime @FastGTPSA! $t = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; 5.407 μs (0 allocations: 0 bytes)
julia> y = rand(3); @gensym z; # transparent to non-TPS types
julia> @btime @FastGTPSA! $z = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; 22.261 ns (0 allocations: 0 bytes)

Both @FastGTPSA and @FastGTPSA! can also be prepended to a block of code, in which case they are applied to each assignment in the block:


julia> d = Descriptor(3, 7); x = vars(d);
julia> y = rand(3);
julia> @btime @FastGTPSA begin t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; t2 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; z = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; end; 11.011 μs (4 allocations: 3.88 KiB)
julia> t3 = ComplexTPS64(); t4 = ComplexTPS64(); @gensym w;
julia> @btime @FastGTPSA! begin $t3 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; $t4 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; $w = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; end; 10.750 μs (0 allocations: 0 bytes)

@FastGTPSA and @FastGTPSA! are also compatible with broadcasted, vectorized operators. This can be very useful for example if a structure-of-arrays (SoA) layout is used in a simulation program, but you would like to calculate a Taylor map of the output by propagating GTPSAs through. Note that for @FastGTPSA, which allocates the result, that there are two allocations per TPS and one for the Array. For @FastGTPSA!, which requires a pre-allocated output, there are zero allocations. Both are still transparent to non-TPS types for universally polymorphic use without cost.


julia> @btime @FastGTPSA begin out = @. $x^3*sin($y)/log(2+$x)-exp($x*$y)*im; end; 14.287 μs (7 allocations: 5.89 KiB)
julia> out = zeros(ComplexTPS64, 3); # pre-allocate
julia> @btime @FastGTPSA! begin @. $out = $x^3*sin($y)/log(2+$x)-exp($x*$y)*im; end; 14.046 μs (0 allocations: 0 bytes)

If errors are thrown during usage of @FastGTPSA/@FastGTPSA!, temporaries in use may not have been properly popped from the stack. Therefore, if the Julia session is not terminated, the helper functions GTPSA.checktemps and `GTPSA.cleartemps! should be used to evaluate and correct the state of the buffer.

Without using the macro, each time an operation is performed using a TPS, a new TPS is dynamically-allocated containing the result. For example in the above expression, the calculation of sin(x[2]) creates a new TPS, and the calculation of x[1]^3 also creates a new TPS. The multiplication of these two resulting TPSs creates a new TPS, and so on until a TPS containing the full result of the evaluated expression is obtained. The intermediate TPSs that must be created to evaluate the expression are referred to as temporaries, because they only exist temporarily. Julia's garbage collector notices when the dynamically-allocated temporaries are no longer in scope, and cleans up that memory when it decides it's a good time to. This process can cause slowdowns in performance critical code however, especially in more complicated expressions where a lot of temporaries are created.

Both @FastGTPSA and @FastGTPSA! basically tell the code to instead use a permanent, pre-allocated thread-safe buffer of TPSs for the temporaries during evaluation of the expression, so there is no dynamic memory allocation; for @FastGTPSA, the number of allocations is reduced to two (which is one single TPS), and for @FastGTPSA! the number of allocations is zero. Furthermore, these temporaries are accessed and deleted in a stack-like manner from the buffer by each thread, so that temporaries involved in operations are right next to each other in memory. This ensures minimal cache misses throughout the evaluation of the expression.

The speedup of using the macro can be quite significant. See our example, where we observe a x2 speedup at 2nd order.

Note on Julia Versions < v1.10

In order to support vectorized/broadcasted operations using @FastGTPSA! with zero allocations, setindex! was overloaded for Array{<:TPS} types. This can lead to massive speedups in calculating Taylor maps in simulations using a structure-of-arrays layout of memory. However, it turns out that on Julia versions < v1.10, merely overloading this function causes allocations when using the macro. It currently is not understood why (seems like a bug in previous Julia versions), however we therefore strongly recommend using Julia version >= v1.10 with this package. If you are stuck with Julia v1.9, then you can use GTPSA v1.1.1, which is the latest version that does not have this particular overload. With this, you cannot do broadcasted operators using the macros, but you will have zero allocations for non-vector TPS operations.

Documentation

GTPSA.@FastGTPSAMacro
@FastGTPSA(expr_or_block)

Macro to speed up evaluation of mathematical expressions containing TPSs. The temporaries generated during evaluation of the expression are drawn from a thread-safe buffer, reducing the number of heap allocations to 2 (which is for a single TPS) for the result. @FastGTPSA is completely transparent to all other types, so it can be prepended to expressions while still maintaining type-generic code.

julia> using GTPSA, BenchmarkTools

julia> d = Descriptor(3,7); x = vars(d);

julia> t = @btime $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
  3.458 μs (20 allocations: 11.88 KiB)

julia> t = @btime @FastGTPSA $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
  3.172 μs (2 allocations: 1.94 KiB)

@FastGTPSA can also be prepended to a block of code, in which case it is applied after every = sign in the block:

julia> @btime @FastGTPSA begin
       t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
       t2 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
       a = t1+t2
       L = 5+3*exp(7)
       end
  6.317 μs (6 allocations: 5.81 KiB)

Broadcasting is also compatible with @FastGTPSA (note two allocations per TPS and one allocation per Array):

julia> using GTPSA, BenchmarkTools

julia> d = Descriptor(3, 7); x = vars(d); y = rand(3);

julia> @btime @FastGTPSA begin
        out = @. $x^3*sin($y)/log(2+$x)-exp($x*$y)*im;
       end;
  7.573 μs (7 allocations: 5.89 KiB)
source
GTPSA.@FastGTPSA!Macro
@FastGTPSA!(expr_or_block)

Macro to speed up evaluation of mathematical expressions that may contain assignment to pre-allocated TPSs. The temporaries generated during evaluation of the expression are drawn from a thread-safe buffer. With this macro, the number of heap allocations during evaluation of expressions containing TPSs is 0, and it is fully transparent to non-TPS types.

This macro supports assignment using =, +=, -=, *=, /=, and ^=.

Important: The symbols must be defined prior to calling the macro regardless of whether or not the type is a TPS:

julia> using GTPSA, BenchmarkTools

julia> d = Descriptor(3,7); x = vars(d); 

julia> t = ComplexTPS64(); # Pre-allocate

julia> @btime @FastGTPSA! $t = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
  2.972 μs (0 allocations: 0 bytes)

julia> @btime @FastGTPSA! $t ^= $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
  6.550 μs (0 allocations: 0 bytes)

julia> y = rand(3); z = 2; # transparent to non-TPS types 

julia>  @btime @FastGTPSA! $z = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
  11.344 ns (0 allocations: 0 bytes)

Like @FastGTPSA, @FastGTPSA! can prepended to a block of code, in which case it is applied before every line in the block containing assignment:

julia> t1 = zero(ComplexTPS64); t2 = zero(ComplexTPS64); z = 0; 

julia> @btime @FastGTPSA! begin
       $t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
       $t2 -= $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
       $z += 7
       end
  5.965 μs (0 allocations: 0 bytes)

Broadcasting is also compatible with @FastGTPSA!:

julia> using GTPSA, BenchmarkTools

julia> d = Descriptor(3, 7); x = vars(d); y = rand(3);

julia> out = zeros(ComplexTPS64, 3) # pre-allocate

julia> @btime @FastGTPSA! begin
        @. $out = $x^3*sin($y)/log(2+$x)-exp($x*$y)*im;
       end;
  7.312 μs (0 allocations: 0 bytes)
source
GTPSA.cleartemps!Function
GTPSA.cleartemps!(d::Descriptor=GTPSA.desc_current)

Clears the "stack" of temporaries currently in use by the Descriptor. This is necessary to run if GTPSA.checktemps(d::Descriptor=GTPSA.desc_current) returns false; this occurs if an error is thrown during evaluation of an expression using @FastGTPSA or @FastGTPSA!, and the Julia session is not terminated.

source
GTPSA.checktempsFunction
GTPSA.checktemps(d::Descriptor=GTPSA.desc_current)

Sanity check of the temporary buffer in the Descriptor used by @FastGTPSA. Returns true if everything is OK, else false in which case GTPSA.cleartemps!(d::Descriptor=GTPSA.desc_current) should be run. This may occur if an error is thrown during evaluation of an expression using @FastGTPSA or @FastGTPSA!.

source