@FastGTPSA
/@FastGTPSA!
Macros
Speed up evaluation of expressions containing TPSs, transparent to other types
The thread-safe macros @FastGTPSA
/@FastGTPSA!
can be used to speed up evaluation of expressions that may contain TPS
s. Both macros are completely transparent to all other types, so they can be prepended to any existing expressions while still maintaining type-generic code. Any functions in the expression that are not overloaded by GTPSA will be ignored. Both macros do not use any -ffast-math
business (so still IEEE compliant), but instead will use a thread-safe pre-allocated buffer in the Descriptor
for any temporaries that may be generated during evaluation of the expression.
The first macro, @FastGTPSA
can be prepended to an expression following assignment (=
, +=
, etc) to only construct one TPS
(which requires two allocations), instead of a TPS
for every temporary:
julia> using GTPSA, BenchmarkTools
julia> d = Descriptor(3, 7); x = vars(d);
julia> @btime $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
5.871 μs (20 allocations: 11.88 KiB)
julia> @btime @FastGTPSA $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
5.482 μs (2 allocations: 1.94 KiB)
julia> y = rand(3); # transparent to non-TPS types
julia> @btime $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
20.067 ns (0 allocations: 0 bytes)
julia> @btime @FastGTPSA $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
21.274 ns (0 allocations: 0 bytes)
The second macro, @FastGTPSA!
can be prepended to the LHS of an assignment, and will fill a preallocated TPS
with the result of an expression. @FastGTPSA!
will calculate a TPS
expression with zero allocations, and will still have no impact if a non-TPS type is used. The only requirement is that all symbols in the expression are defined:
julia> t = ComplexTPS64(); # pre-allocate
julia> @btime @FastGTPSA! $t = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
5.338 μs (0 allocations: 0 bytes)
julia> y = rand(3); @gensym z; # transparent to non-TPS types
julia> @btime @FastGTPSA! $z = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
20.057 ns (0 allocations: 0 bytes)
Both @FastGTPSA
and @FastGTPSA!
can also be prepended to a block of code, in which case they are applied to each assignment in the block:
julia> d = Descriptor(3, 7); x = vars(d);
julia> y = rand(3);
julia> @btime @FastGTPSA begin t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; t2 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; z = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; end;
11.131 μs (4 allocations: 3.88 KiB)
julia> t3 = ComplexTPS64(); t4 = ComplexTPS64(); @gensym w;
julia> @btime @FastGTPSA! begin $t3 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; $t4 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im; $w = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im; end;
10.861 μs (0 allocations: 0 bytes)
@FastGTPSA
and @FastGTPSA!
are also compatible with broadcasted, vectorized operators. This can be very useful for example if a structure-of-arrays (SoA) layout is used in a simulation program, but you would like to calculate a Taylor map of the output by propagating GTPSAs through. Note that for @FastGTPSA
, which allocates the result, that there are two allocations per TPS
and one for the Array
. For @FastGTPSA!
, which requires a pre-allocated output, there are zero allocations. Both are still transparent to non-TPS
types for universally polymorphic use without cost.
julia> @btime @FastGTPSA begin out = @. $x^3*sin($y)/log(2+$x)-exp($x*$y)*im; end;
14.436 μs (7 allocations: 5.89 KiB)
julia> out = zeros(ComplexTPS64, 3); # pre-allocate
julia> @btime @FastGTPSA! begin @. $out = $x^3*sin($y)/log(2+$x)-exp($x*$y)*im; end;
14.046 μs (0 allocations: 0 bytes)
If errors are thrown during usage of @FastGTPSA
/@FastGTPSA!
, temporaries in use may not have been properly popped from the stack. Therefore, if the Julia session is not terminated, the helper functions GTPSA.checktemps
and `GTPSA.cleartemps! should be used to evaluate and correct the state of the buffer.
Without using the macro, each time an operation is performed using a TPS, a new TPS is dynamically-allocated containing the result. For example in the above expression, the calculation of sin(x[2])
creates a new TPS, and the calculation of x[1]^3
also creates a new TPS. The multiplication of these two resulting TPSs creates a new TPS, and so on until a TPS containing the full result of the evaluated expression is obtained. The intermediate TPSs that must be created to evaluate the expression are referred to as temporaries, because they only exist temporarily. Julia's garbage collector notices when the dynamically-allocated temporaries are no longer in scope, and cleans up that memory when it decides it's a good time to. This process can cause slowdowns in performance critical code however, especially in more complicated expressions where a lot of temporaries are created.
Both @FastGTPSA
and @FastGTPSA!
basically tell the code to instead use a permanent, pre-allocated thread-safe buffer of TPSs for the temporaries during evaluation of the expression, so there is no dynamic memory allocation; for @FastGTPSA
, the number of allocations is reduced to two (which is one single TPS
), and for @FastGTPSA!
the number of allocations is zero. Furthermore, these temporaries are accessed and deleted in a stack-like manner from the buffer by each thread, so that temporaries involved in operations are right next to each other in memory. This ensures minimal cache misses throughout the evaluation of the expression.
The speedup of using the macro can be quite significant. See our example, where we observe a x2 speedup at 2nd order.
Note on Julia Versions < v1.10
In order to support vectorized/broadcasted operations using @FastGTPSA!
with zero allocations, setindex!
was overloaded for Array{<:TPS}
types. This can lead to massive speedups in calculating Taylor maps in simulations using a structure-of-arrays layout of memory. However, it turns out that on Julia versions < v1.10, merely overloading this function causes allocations when using the macro. It currently is not understood why (seems like a bug in previous Julia versions), however we therefore strongly recommend using Julia version >= v1.10 with this package. If you are stuck with Julia v1.9, then you can use GTPSA v1.1.1, which is the latest version that does not have this particular overload. With this, you cannot do broadcasted operators using the macros, but you will have zero allocations for non-vector TPS
operations.
Documentation
GTPSA.@FastGTPSA
— Macro@FastGTPSA(expr_or_block)
Macro to speed up evaluation of mathematical expressions containing TPSs. The temporaries generated during evaluation of the expression are drawn from a thread-safe buffer, reducing the number of heap allocations to 2 (which is for a single TPS) for the result. @FastGTPSA is completely transparent to all other types, so it can be prepended to expressions while still maintaining type-generic code.
julia> using GTPSA, BenchmarkTools
julia> d = Descriptor(3,7); x = vars(d);
julia> t = @btime $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
3.458 μs (20 allocations: 11.88 KiB)
julia> t = @btime @FastGTPSA $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
3.172 μs (2 allocations: 1.94 KiB)
@FastGTPSA
can also be prepended to a block of code, in which case it is applied after every =
sign in the block:
julia> @btime @FastGTPSA begin
t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
t2 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
a = t1+t2
L = 5+3*exp(7)
end
6.317 μs (6 allocations: 5.81 KiB)
Broadcasting is also compatible with @FastGTPSA
(note two allocations per TPS
and one allocation per Array
):
julia> using GTPSA, BenchmarkTools
julia> d = Descriptor(3, 7); x = vars(d); y = rand(3);
julia> @btime @FastGTPSA begin
out = @. $x^3*sin($y)/log(2+$x)-exp($x*$y)*im;
end;
7.573 μs (7 allocations: 5.89 KiB)
GTPSA.@FastGTPSA!
— Macro@FastGTPSA!(expr_or_block)
Macro to speed up evaluation of mathematical expressions that may contain assignment to pre-allocated TPS
s. The temporaries generated during evaluation of the expression are drawn from a thread-safe buffer. With this macro, the number of heap allocations during evaluation of expressions containing TPS
s is 0, and it is fully transparent to non-TPS types.
This macro supports assignment using =
, +=
, -=
, *=
, /=
, and ^=
.
Important: The symbols must be defined prior to calling the macro regardless of whether or not the type is a TPS
:
julia> using GTPSA, BenchmarkTools
julia> d = Descriptor(3,7); x = vars(d);
julia> t = ComplexTPS64(); # Pre-allocate
julia> @btime @FastGTPSA! $t = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
2.972 μs (0 allocations: 0 bytes)
julia> @btime @FastGTPSA! $t ^= $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
6.550 μs (0 allocations: 0 bytes)
julia> y = rand(3); z = 2; # transparent to non-TPS types
julia> @btime @FastGTPSA! $z = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
11.344 ns (0 allocations: 0 bytes)
Like @FastGTPSA
, @FastGTPSA!
can prepended to a block of code, in which case it is applied before every line in the block containing assignment:
julia> t1 = zero(ComplexTPS64); t2 = zero(ComplexTPS64); z = 0;
julia> @btime @FastGTPSA! begin
$t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
$t2 -= $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
$z += 7
end
5.965 μs (0 allocations: 0 bytes)
Broadcasting is also compatible with @FastGTPSA!
:
julia> using GTPSA, BenchmarkTools
julia> d = Descriptor(3, 7); x = vars(d); y = rand(3);
julia> out = zeros(ComplexTPS64, 3) # pre-allocate
julia> @btime @FastGTPSA! begin
@. $out = $x^3*sin($y)/log(2+$x)-exp($x*$y)*im;
end;
7.312 μs (0 allocations: 0 bytes)
GTPSA.cleartemps!
— FunctionGTPSA.cleartemps!(d::Descriptor=GTPSA.desc_current)
Clears the "stack" of temporaries currently in use by the Descriptor
. This is necessary to run if GTPSA.checktemps(d::Descriptor=GTPSA.desc_current)
returns false
; this occurs if an error is thrown during evaluation of an expression using @FastGTPSA
or @FastGTPSA!
, and the Julia session is not terminated.
GTPSA.checktemps
— FunctionGTPSA.checktemps(d::Descriptor=GTPSA.desc_current)
Sanity check of the temporary buffer in the Descriptor
used by @FastGTPSA. Returns true
if everything is OK, else false
in which case GTPSA.cleartemps!(d::Descriptor=GTPSA.desc_current)
should be run. This may occur if an error is thrown during evaluation of an expression using @FastGTPSA
or @FastGTPSA!
.