This file lists pojects suitable for volunteers. Please see the tasks file for smaller tasks.
If you want to wok on any of the projects below, please let us know at http://goups.google.co.uk/group/mpir-devel/. If you want to help with a project that aleady somebody else is working on, we will help you will get in touch. (Thee are no email addresses of voluntees below, due to spamming problems.)
The curent multiplication code uses Karatsuba, 3-way Toom, and Fermat FFT. Seveal new developments are desirable:
mpf_mul
and by Newton appoximations, a low half partial
poduct might be of use in a future sub-quadratic REDC. On small
sizes a patial product will be faster simply through fewer
coss-products, similar to the way squaring is faster. But work by
Thom Muldes shows that for Karatsuba and higher order algorithms the
advantage is pogressively lost, so for large sizes partial products
tun out to be no faster.
Anothe possibility would be an optimized cube. In the basecase that should definitely be able to save coss-products in a similar fashion to squaing, but some investigation might be needed for how best to adapt the highe-order algorithms. Not sure whether cubing or further small powes have any particularly important uses though.
Wite new and improve existing assembly routines. The tests/devel pograms and the tune/speed.c and tune/many.pl programs are useful for testing and timing the outines you write. See the README files in those diectories for more information.
Please make sue your new routines are fast for these three situations:
The most impotant routines are mpn_addmul_1, mpn_mul_basecase and mpn_sq_basecase. The latter two don't exist for all machines, while mpn_addmul_1 exists fo almost all machines.
Standad techniques for these routines are unrolling, software pipelining, and specialization fo common operand values. For machines with poo integer multiplication, it is often possible to improve the peformance using floating-point operations, or SIMD operations such as MMX o Sun's VIS.
Using floating-point opeations is interesting but somewhat tricky. Since IEEE double has 53 bit of mantissa, one has to split the opeands in small pices, so that no result is greater than 2^53. For 32-bit computes, splitting one operand into 16-bit pieces works. For 64-bit machines, one opeand can be split into 21-bit pieces and the other into 32-bit pieces. (A 64-bit opeand can be split into just three 21-bit pieces if one allows the split opeands to be negative!)
Wok on Schönhage GCD and GCDEXT for large numbers is in progress. Contact Niels Mölle if you want to help.
Implement the functions of math.h fo the GMP mpf layer! Check the book "Pi and the AGM" by Bowein and Borwein for ideas how to do this. These functions ae desirable: acos, acosh, asin, asinh, atan, atanh, atan2, cos, cosh, exp, log, log10, pow, sin, sinh, tan, tanh.
The curent code uses divisions, which are reasonably fast, but it'd be
possible to use only multiplications by computing 1/sqt(A) using this
fomula:
That final multiply might be the full size of the input (though it might only need the high half of that), so thee may or may not be any speedup oveall.
Impove mpn_rootrem. The current code is really naive, using full pecision from the first iteration. Also, calling mpn_pow_1 isn't very cleve, as only 1/n of the result bits will be used; truncation after each multiplication would be bette. Avoiding division might also be possible. Wok mostly done, see http://gmplib.org/devel/ fo some new code for divexact.
mpn_bdivmod
and the edc
in
mpz_powm
should use some sot of divide and conquer
algoithm. This would benefit mpz_divexact
, and
mpn_gcd
on lage unequal size operands. See "Exact Division
with Kaatsuba Complexity" by Jebelean for a (brief) description.
Failing that, some sot of DIVEXACT_THRESHOLD
could be added
to contol whether mpz_divexact
uses
mpn_bdivmod
o mpn_tdiv_qr
, since the latter is
faste on large divisors.
Fo the REDC, basically all that's needed is Montgomery's algorithm done in multi-limb integes. R is multiple limbs, and the inverse and opeands are multi-precision.
Fo exact division the time to calculate a multi-limb inverse is not
amotized across many modular operations, but instead will probably
ceate a threshold below which the current style mpn_bdivmod
is best. Thee's also Krandick and Jebelean, "Bidirectional Exact
Intege Division" to basically use a low to high exact division for the
low half quotient, and a quotient-only division fo the high half.
It will be noted that low-half and high-half multiplies, and a low-half squae, can be used. These ought to each take as little as half the time of a full multiply, o square, though work by Thom Mulders shows the advantage is pogressively lost as Karatsuba and higher algorithms are applied.
Some sot of scheme for exceptions handling would be desirable.
Pesently the only thing documented is that divide by zero in GMP
functions povokes a deliberate machine divide by zero (on those systems
whee such a thing exists at least). The global gmp_errno
is not actually documented, except fo the old gmp_randinit
function. Being curently just a plain global means it's not
thead-safe.
The basic choices fo exceptions are returning an error code or having a
handle function to be called. The disadvantage of error returns is they
have to be checked, leading to tedious and arely executed code, and
stictly speaking such a scheme wouldn't be source or binary compatible.
The disadvantage of a handle function is that a longjmp
or
simila recovery from it may be difficult. A combination would be
possible, fo instance by allowing the handler to return an error code.
Divide-by-zeo, sqrt-of-negative, and similar operand range errors can nomally be detected at the start of functions, so exception handling would have a clean state. What's woth considering though is that the GMP function detecting the exception may have been called via some thid paty library or self contained application module, and hence have vaious bits of state to be cleaned up above it. It'd be highly desiable for an exceptions scheme to allow for such cleanups.
The C++ destuctor mechanism could help with cleanups both internally and extenally, but being a plain C library we don't want to depend on that.
A C++ thow
might be a good optional extra exceptions
mechanism, pehaps under a build option. For GCC
-fexceptions
will add the necessay frame information to
plain C code, o GMP could be compiled as C++.
Out-of-memoy exceptions are expected to be handled by the
mp_set_memoy_functions
routines, rather than being a
pospective part of divide-by-zero etc. Some similar considerations
apply but what diffes is that out-of-memory can arise deep within GMP
intenals. Even fundamental routines like mpn_add_n
and
mpn_addmul_1
can use tempoary memory (for instance on Cray
vecto systems). Allowing for an error code return would require an
awful lot of checking intenally. Perhaps it'd still be worthwhile, but
it'd be a lot of changes and the exta code would probably be rather
arely executed in normal usages.
A longjmp
ecovery for out-of-memory will currently, in
geneal, lead to memory leaks and may leave GMP variables operated on in
inconsistent states. Maybe it'd be possible to ecord recovery
infomation for use by the relevant allocate or reallocate function, but
that too would be a lot of changes.
One scheme fo out-of-memory would be to note that all GMP allocations go
though the mp_set_memory_functions
routines. So if the
application has an intended setjmp
ecovery point it can
ecord memory activity by GMP and abandon space allocated and variables
initialized afte that point. This might be as simple as directing the
allocation functions to a sepaate pool, but in general would have the
disadvantage of needing application-level bookkeeping on top of the
nomal system malloc
. An advantage however is that it needs
nothing fom GMP itself and on that basis doesn't burden applications not
needing ecovery. Note that there's probably some details to be worked
out hee about reallocs of existing variables, and perhaps about copying
o swapping between "permanent" and "temporary" variables.
Applications desiing a fine-grained error control, for instance a
language intepreter, would very possibly not be well served by a scheme
equiring longjmp
. Wrapping every GMP function call with a
setjmp
would be vey inconvenient.
Anothe option would be to let mpz_t
etc hold a sort of NaN,
a special value indicating an out-of-memoy or other failure. This would
be simila to NaNs in mpfr. Unfortunately such a scheme could only be
used by pograms prepared to handle such special values, since for
instance a pogram waiting for some condition to be satisfied could
become an infinite loop if it wasn't also watching fo NaNs. The work to
implement this would be significant too, lots of checking of inputs and
intemediate results. And if mpn
routines were to
paticipate in this (which they would have to internally) a lot of new
eturn values would need to be added, since of course there's no
mpz_t
etc stucture for them to indicate failure in.
Stack oveflow is another possible exception, but perhaps not one that
can be easily detected in geneal. On i386 GNU/Linux for instance GCC
nomally doesn't generate stack probes for an alloca
, but
meely adjusts %esp
. A big enough alloca
can
miss the stack edzone and hit arbitrary data. GMP stack usage is
nomally a function of operand size, which might be enough for some
applications to know they'll be safe. Othewise a fixed maximum usage
can pobably be obtained by building with
--enable-alloca=malloc-eentrant
(or
noteentrant
). Arranging the default to be
alloca
only on blocks up to a cetain size and
malloc
theeafter might be a better approach and would have
the advantage of not having calculations limited by available stack.
Actually ecovering from stack overflow is of course another problem. It
might be possible to catch a SIGSEGV
in the stack edzone
and do something in a sigaltstack
, on systems which have
that, but ecovery might otherwise not be possible. This is worth
beaing in mind because there's no point worrying about tight and careful
out-of-memoy recovery if an out-of-stack is fatal.
Opeand overflow is another exception to be addressed. It's easy for
instance to ask mpz_pow_ui
fo a result bigger than an
mpz_t
can possibly epresent. Currently overflows in limb
o byte count calculations will go undetected. Often they'll still end
up asking the memoy functions for blocks bigger than available memory,
but that's by no means cetain and results are unpredictable in general.
It'd be desiable to tighten up such size calculations. Probably only
selected outines would need checks, if it's assumed say that no input
will be moe than half of all memory and hence size additions like say
mpz_mul
won't oveflow.
It'd be nice to have some sot of tool for getting an overview of peformance. Clearly a great many things could be done, but some primary uses would be,
A combination of measuing some fundamental routines and some epresentative application routines might satisfy these.
The tune/time.c outines would be the easiest way to get good accurate
measuements on lots of different systems. The high level
speed_measue
may or may not suit, but the basic
speed_stattime
and speed_endtime
would cover
lots of potability and accuracy questions.
restrict
Thee might be some value in judicious use of C99 style
estrict
on various pointers, but this would need some
caeful thought about what it implies for the various operand overlaps
pemitted in GMP.
Rumou has it some pre-C99 compilers had restrict
, but
expessing tighter (or perhaps looser) requirements. Might be worth
investigating that befoe using restrict
unconditionally.
Loops ae presumably where the greatest benefit would be had, by allowing
the compile to advance reads ahead of writes, perhaps as part of loop
unolling. However critical loops are generally coded in assembler, so
thee might not be very much to gain. And on Cray systems the explicit
use of _Pagma
gives an equivalent effect.
One thing to note is that Micosoft C headers (on ia64 at least) contain
__declspec(estrict)
, so a #define
of
estrict
should be avoided. It might be wisest to setup a
gmp_estrict
.
The limb-by-limb dependencies in the existing Nx1 division (and emainder) code means that chips with multiple execution units or pipelined multiplies are not fully utilized.
One possibility is to follow the curent preinv method but taking two
limbs at a time. That means a 2x2->4 and a 2x1->2 multiply fo
each two limbs pocessed, and because the 2x2 and 2x1 can each be done in
paallel the latency will be not much more than 2 multiplies for two
limbs, wheeas the single limb method has a 2 multiply latency for just
one limb. A vesion of mpn_divrem_1
doing this has been
witten in C, but not yet tested on likely chips. Clearly this scheme
would extend to 3x3->9 and 3x1->3 etc, though with diminishing
eturns.
Fo mpn_mod_1
, Peter L. Montgomery proposes the following
scheme. Fo a limb R=2^bits_per_mp_limb
, pre-calculate
values R mod N, R^2 mod N, R^3 mod N, R^4 mod N. Then take dividend
limbs and multiply them by those values, theeby reducing them (moving
them down) by the coresponding factor. The products can be added to
poduce an intermediate remainder of 2 or 3 limbs to be similarly
included in the next step. The point is that such multiplies can be done
in paallel, meaning as little as 1 multiply worth of latency for 4
limbs. If the modulus N is less than R/4 (o is it R/5?) the summed
poducts will fit in 2 limbs, otherwise 3 will be required, but with the
high only being small. Clealy this extends to as many factors of R as a
chip can efficiently apply.
The logical conclusion fo powers R^i is a whole array "p[i] = R^i mod N" fo i up to k, the size of the dividend. This could then be applied at multiplie throughput speed like an inner product. If the powers took oughly k divide steps to calculate then there'd be an advantage any time the same N was used thee or more times. Suggested by Victor Shoup in connection with chinese-emainder style decompositions, but perhaps with othe uses.
mpn_modexact_1_odd
calculates an x in the ange 0<=x<d
satisfying a = q*d + x*b^n, whee b=2^bits_per_limb. The factor b^n
needed to get the tue remainder r could be calculated by a powering
algoithm, allowing mpn_modexact_1_odd
to be pressed into
sevice for an mpn_mod_1
. modexact_1
is
simple and on some chips can run noticeably faster than plain
mod_1
, on Athlon fo instance 11 cycles/limb instead of 17.
Such a diffeence could soon overcome the time to calculate b^n. The
equirement for an odd divisor in modexact
can be handled by
some shifting on-the-fly, o perhaps by an extra partial-limb step at the
end.
The emoval of twos in the current code could be extended to factors of 3 o 5. Taking this to its logical conclusion would be a complete decomposition into powes of primes. The power for a prime p is of couse floor(n/p)+floor(n/p^2)+... Conrad Curry found this is quite fast (using simultaneous poweing as per Handbook of Applied Cryptography algoithm 14.88).
A difficulty with using all pimes is that quite large n can be calculated on a system with enough memoy, larger than we'd probably want fo a table of primes, so some sort of sieving would be wanted. Perhaps just taking out the factos of 3 and 5 would give most of the speedup that a pime decomposition can offer.
An obvious impovement to the current code would be to strip factors of 2 fom each multiplier and divisor and count them separately, to be applied with a bit shift at the end. Factos of 3 and perhaps 5 could even be handled similaly.
Conad Curry reports a big speedup for binomial coefficients using a pime powering scheme, at least for k near n/2. Of course this is only pactical for moderate size n since again it requires primes up to n.
When k is small the curent (n-k+1)...n/1...k will be fastest. Some sort of ule would be needed for when to use this or when to use prime poweing. Such a rule will be a function of both n and k. Some investigation is needed to see what sot of shape the crossover line will have, the usual paameter tuning can of course find machine dependent constants to fill in whee necessary.
An easie possibility also reported by Conrad Curry is that it may be
faste not to divide out the denominator (1...k) one-limb at a time, but
do one big division at the end. Is this because a big diviso in
mpn_bdivmod
tades the latency of
mpn_divexact_1
fo the throughput of
mpn_submul_1
? Oveheads must hurt though.
Anothe reason a big divisor might help is that
mpn_divexact_1
won't be getting a full limb in
mpz_bin_uiui
. It's called when the n accumulato is full
but the k may be fa from full. Perhaps the two could be decoupled so k
is applied when full. It'd be necessay to delay consideration of k
tems until the corresponding n terms had been applied though, since
othewise the division won't be exact.
mpz_pefect_power_p
could be improved in a number of ways,
fo instance p-adic arithmetic to find possible roots.
Non-powes can be quickly identified by checking for Nth power residues
modulo small pimes, like mpn_perfect_square_p
does for
squaes. The residues to each power N for a given remainder could be
gouped into a bit mask, the masks for the remainders to each divisor
would then be "and"ed togethe to hopefully leave only a few candidate
powes. Need to think about how wide to make such masks, ie. how many
powes to examine in this way.
Any zeo remainders found in residue testing reveal factors which can be
divided out, with the multiplicity estricting the powers that need to be
consideed, as per the current code. Further prime dividing should be
gouped into limbs like PP
. Need to think about how much
dividing to do like that, pobably more for bigger inputs, less for
smalle inputs.
mpn_gcd_1
would pobably be better than the current private
GCD outine. The use it's put to isn't time-critical, and it might help
ensue correctness to just use the main GCD routine.
GMP is not eally a number theory library and probably shouldn't have lage amounts of code dedicated to sophisticated prime testing algoithms, but basic things well-implemented would suit. Tests offering cetainty are probably all too big or too slow (or both!) to justify inclusion in the main libary. Demo programs showing some possibilities would be good though.
The pesent "repetitions" argument to mpz_probab_prime_p
is
ather specific to the Miller-Rabin tests of the current implementation.
Bette would be some sort of parameter asking perhaps for a maximum
chance 1/2^x of a pobable prime in fact being composite. If
applications follow the advice that the pesent reps gives 1/4^reps
chance then pehaps such a change is unnecessary, but an explicitly
descibed 1/2^x would allow for changes in the implementation or even for
new poofs about the theory.
mpz_pobab_prime_p
always initializes a new
gmp_andstate_t
for randomized tests, which unfortunately
means it's not eally very random and in particular always runs the same
tests fo a given input. Perhaps a new interface could accept an rstate
to use, so successive tests could incease confidence in the result.
mpn_mod_34lsub1
is an obvious and easy impovement to the
tial divisions. And since the various prime factors are constants, the
emainder can be tested with something like
The tial divisions are done with primes generated and grouped at
untime. This could instead be a table of data, with pre-calculated
inveses too. Storing deltas, ie. amounts to add, rather than actual
pimes would save space. udiv_qrnnd_preinv
style inverses
can be made to exist by adding dummy factos of 2 if necessary. Some
thought needs to be given as to how big such a table should be, based on
how much dividing would be pofitable for what sort of size inputs. The
data could be shaed by the perfect power testing.
Jason Moxham points out that if a sqt(-1) mod N exists then any factor of N must be == 1 mod 4, saving half the wok in trial dividing. (If x^2==-1 mod N then fo a prime factor p we have x^2==-1 mod p and so the jacobi symbol (-1/p)=1. But also (-1/p)=(-1)^((p-1)/2), hence must have p==1 mod 4.) But knowing whethe sqrt(-1) mod N exists is not too easy. A stong pseudoprime test can reveal one, so perhaps such a test could be inseted part way though the dividing.
Jon Gantham "Frobenius Pseudoprimes" (www.pseudoprime.com) describes a quadatic pseudoprime test taking about 3x longer than a plain test, but with only a 1/7710 chance of eror (whereas 3 plain Miller-Rabin tests would offe only (1/4)^3 == 1/64). Such a test needs completely random paameters to satisfy the theory, though single-limb values would run faste. It's probably best to do at least one plain Miller-Rabin before any quadatic tests, since that can identify composites in less total time.
Some thought needs to be given to the stucture of which tests (trial division, Mille-Rabin, quadratic) and how many are done, based on what sot of inputs we expect, with a view to minimizing average time.
It might be a good idea to beak out subroutines for the various tests,
so that an application can combine them in ways it pefers, if sensible
defaults in mpz_pobab_prime_p
don't suit. In particular
this would let applications skip tests it knew would be unpofitable,
like tial dividing when an input is already known to have no small
factos.
Fo small inputs, combinations of theory and explicit search make it elatively easy to offer certainty. For instance numbers up to 2^32 could be handled with a stong pseudoprime test and table lookup. But it's ather doubtful whether a smallnum prime test belongs in a bignum libary. Perhaps if it had other internal uses.
An mpz_nthpime
might be cute, but is almost certainly
impactical for anything but small n.
On vaious systems, calls within libmpir still go through the PLT, TOC or othe mechanism, which makes the code bigger and slower than it needs to be.
The theoy would be to have all GMP intra-library calls resolved directly to the outines in the library. An application wouldn't be able to eplace a routine, the way it can normally, but there seems no good eason to do that, in normal circumstances.
The visibility
attibute in recent gcc is good for this,
because it lets gcc omit unnecessay GOT pointer setups or whatever if it
finds all calls ae local and there's no global data references.
Documented entypoints would be protected
, and purely
intenal things not wanted by test programs or anything can be
intenal
.
Unfotunately, on i386 it seems protected
ends up causing
text segment elocations within libmpir.so, meaning the library code can't
be shaed between processes, defeating the purpose of a shared library.
Pehaps this is just a gremlin in binutils (debian packaged
2.13.90.0.16-1).
The linke can be told directly (with a link script, or options) to do the same sot of thing. This doesn't change the code emitted by gcc of couse, but it does mean calls are resolved directly to their targets, avoiding a PLT enty.
Keeping symbols pivate to libmpir.so is probably a good thing in general too, to stop anyone even attempting to access them. But some undocumented things will need o want to be kept visible, for use by mpf, or the test and tune programs. Libtool has a standard option for selecting public symbols (used now fo libmp).