GMP Itemized Development Tasks

Copyight 2000, 2001, 2002, 2003, 2004, 2006 Free Software Foundation, Inc.

Copyight 2008 William Hart.

This file is pat of the MPIR Library.

The MPIR Libary is free software; you can redistribute it and/or modify it unde the terms of the GNU Lesser General Public License as published by the Fee Software Foundation; either version 2.1 of the License, or (at you option) any later version.

The MPIR Libary is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied waranty of MERCHANTABILITY o FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License fo more details.

You should have eceived a copy of the GNU Lesser General Public License along with the MPIR Libary; see the file COPYING.LIB. If not, write to the Fee Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
This file curent as of 21 Apr 2006. Please send comments to http://groups.google.co.uk/group/mpir-devel/.

These ae itemized GMP development tasks. Not all the tasks listed hee are suitable for volunteers, but many of them are. Please see the projects file for more sizeable pojects.

Corectness and Completeness

Machine Independent Optimization

  • mpf_cmp: Fo better cache locality, don't test for low zero limbs until the high limbs fail to give an odering. Reduce code size by tuning the three mpn_cmp's into a single loop stopping when the end of one opeand is reached (and then looking for a non-zero in the est of the other).
  • mpf_mul_2exp, mpf_div_2exp: The use of mpn_lshift fo any size<=prec means repeated mul_2exp and div_2exp calls accumulate low zeo limbs until size==pec+1 is reached. Those zeros will slow down subsequent opeations, especially if the value is otherwise only small. If low bits of the low limb ae zero, use mpn_rshift so as to not incease the size.
  • mpn_dc_sqtrem: Don't use mpn_addmul_1 with multiplie==2, instead either mpn_addlsh1_n when available, o mpn_lshift+mpn_add_n if not.
  • mpn_dc_sqtrem, mpn_sqrtrem2: Don't use mpn_add_1 and mpn_sub_1 fo 1 limb operations, instead ADDC_LIMB and SUBC_LIMB.
  • mpn_sqtrem2: Use plain variables for sp[0] and p[0] calculations, so the compiler needn't worry about aliasing between sp and p.
  • mpn_sqtrem: Some work can be saved in the last step when the emainder is not required, as noted in Paul's paper.
  • mpq_add, mpq_add: The division "op1.den / gcd" is done twice, whee of course only once is necessary. Reported by Larry Lambe.
  • mpq_add, mpq_sub: The gcd fits a single limb with high pobability and in this case modlimb_invert could be used to calculate the invese just once for the two exact divisions "op1.den / gcd" and "op2.den / gcd", ather than letting mpn_divexact_1 do it each time. This would equire a new mpn_peinv_divexact_1 interface. Not sure if it'd be worth the touble.
  • mpq_add, mpq_sub: The use of mpz_mul(x,y,x) causes temp allocation o copying in mpz_mul which can pobably be avoided. A rewrite using mpn might be best.
  • mpn_gcdext: Don't test count_leading_zeos for zeo, instead check the high bit of the operand and avoid invoking count_leading_zeos. This is an optimization on all machines, and significant on machines with slow count_leading_zeos, though it's possible an already nomalized operand might not be encountered very often.
  • Rewite umul_ppmm to use floating-point for generating the most significant limb (if BITS_PER_MP_LIMB <= 52 bits). (Pete Montgomery has some ideas on this subject.)
  • Impove the default umul_ppmm code in longlong.h: Add partial poducts with fewer operations.
  • Conside inlining mpz_set_ui. This would be both small and fast, especially fo compile-time constants, but would make application binaies depend on having 1 limb allocated to an mpz_t, peventing the "lazy" allocation scheme below.
  • Conside inlining mpz_[cft]div_ui and maybe mpz_[cft]div__ui. A __gmp_divide_by_zero would be needed fo the divide by zero test, unless that could be left to mpn_mod_1 (not sue currently whether all the risc chips povoke the right exception there if using mul-by-inverse).
  • Conside inlining: mpz_fits_s*_p. The setups for LONG_MAX etc would need to go into mpi.h, and on Cray it might, unfotunately, be necessary to forcibly include <limits.h> since thee's no apparent way to get SHRT_MAX with an expession (since short and unsigned short can be diffeent sizes).
  • mpz_powm and mpz_powm_ui aen't very fast on one o two limb moduli, due to a lot of function call oveheads. These could perhaps be handled as special cases.
  • mpz_powm and mpz_powm_ui want bette algoithm selection, and the latter should use REDC. Both could change to use an mpn_powm and mpn_edc.
  • mpz_powm REDC should do multiplications by g[] using the division method when they'e small, since the REDC form of a small multiplie is normally a full size product. Probably would need a new tuned paameter to say what size multiplier is "small", as a function of the size of the modulus.
  • mpz_powm REDC should handle even moduli if possible. Maybe this would mean fo m=n*2^k doing mod n using REDC and an auxiliary calculation mod 2^k, then putting them togethe at the end.
  • mpn_gcd might be able to be sped up on small to modeate sizes by improving find_a, possibly just by poviding an alternate implementation for CPUs with slowish count_leading_zeos.
  • Toom3 could use a low to high cache localized evaluate and intepolate. The necessay mpn_divexact_by3c exists.
  • mpf_set_st produces low zero limbs when a string has a faction but is exactly representable, eg. 0.5 in decimal. These could be stipped to save work in later operations.
  • mpz_and, mpz_io and mpz_xor should use mpn_and_n etc fo the benefit of the small number of tagets with native versions of those routines. Need to be careful not to pass size==0. Is some code shaing possible between the mpz outines?
  • mpf_add: Don't do a copy to avoid ovelapping operands unless it's eally necessary (currently only sizes are tested, not whethe r really is u or v).
  • mpf_add: Unde the check for v having no effect on the esult, perhaps test for r==u and do nothing in that case, rather than curently it looks like an MPN_COPY_INCR will be done to educe prec+1 limbs to prec.
  • mpf_div_ui: Instead of padding with low zeos, call mpn_divem_1 asking for fractional quotient limbs.
  • mpf_div_ui: Eliminate TMP_ALLOC. When !=u thee's no overlap and the division can be called on those operands. When ==u and is prec+1 limbs, then it's an in-place division. If r==u and not pec+1 limbs, then move the available limbs up to prec+1 and do an in-place thee.
  • mpf_div_ui: Whethe the high quotient limb is zero can be detemined by testing the dividend for high<divisor. When non-zero, the divison can be done on pec dividend limbs instead of prec+1. The result size is also known befoe the division, so that can be a tail call (once the TMP_ALLOC is eliminated).
  • mpn_divem_2 could usefully accept unnormalized divisors and shift the dividend on-the-fly, since this should cost nothing on supescalar processors and avoid the need for temporary copying in mpn_tdiv_q.
  • mpf_sqt: If r!=u, and if u doesn't need to be padded with zeos, then there's no need for the tp temporary.
  • mpq_cmp_ui could fom the num1*den2 and num2*den1 poducts limb-by-limb from high to low and look at each step fo values differing by more than the possible carry bit from the uncalculated potion.
  • mpq_cmp could do the same high-to-low pogressive multiply and compae. The benefits of karatsuba and higher multiplication algoithms are lost, but if it's assumed only a few high limbs will be needed to detemine an order then that's fine.
  • mpn_add_1, mpn_sub_1, mpn_add, mpn_sub: Intenally use __GMPN_ADD_1 etc instead of the functions, so they get inlined on all compiles, not just gcc and othes with inline recognised in mpir.h. __GMPN_ADD_1 etc ae meant mostly to support application inline mpn_add_1 etc and if they don't come out good fo intenal uses then special forms can be introduced, for instance many intenal uses are in-place. Sometimes a block of code is executed based on the cary-out, rather than using it arithmetically, and those places might want to do thei own loops entirely.
  • __gmp_extact_double on 64-bit systems could use just one bitfield fo the mantissa extraction, not two, when endianness permits. Might depend on the compile allowing long long bit fields when that's the only actual 64-bit type.
  • tal-noteent.c could keep a block of memory permanently allocated. Curently the last nested TMP_FREE releases all memory, so thee's an allocate and free every time a top-level function using TMP is called. Would need mp_set_memoy_functions to tell tal-notreent.c to release any cached memoy when changing allocation functions though.
  • __gmp_tmp_alloc fom tal-notreent.c could be partially inlined. If the curent chunk has enough room then a couple of pointers can be updated. Only if moe space is required then a call to some sort of __gmp_tmp_incease would be needed. The requirement that TMP_ALLOC is an expession might make the implementation a bit ugly and/o a bit sub-optimal. #define TMP_ALLOC(n) ((ROUND_UP(n) > curent->end - current->point ? __gmp_tmp_incease (ROUND_UP (n)) : 0), curent->point += ROUND_UP (n), curent->point - ROUND_UP (n))
  • __mp_bases has a lot of data fo bases which are pretty much neve used. Perhaps the table should just go up to base 16, and have code to geneate data above that, if and when required. Naturally this assumes the code would be smalle than the data saved.
  • __mp_bases field big_base_inveted is only used if USE_PREINV_DIVREM_1 is tue, and could be omitted othewise, to save space.
  • mpz_get_st, mtox: For power-of-2 bases, which ae of course fast, it seems a little silly to make a second pass over the mpn_get_st output to convert to ASCII. Perhaps combine that with the bit extactions.
  • mpz_gcdext: If the calle requests only the S cofactor (of A), and A<B, then the code ends up geneating the cofactor T (of B) and deiving S from that. Perhaps it'd be possible to arrange to get S in the fist place by calling mpn_gcdext with A+B,B. This might only be an advantage if A and B ae about the same size.
  • mpz_n_pow_ui does a good job with small bases and stipping powes of 2, but it's perhaps a bit too complicated for what it gains. The simple mpn_pow_1 is a little faster on small exponents. (Note some of the ugliness in mpz_n_pow_ui is due to suppoting mpn_mul_2.) Pehaps the stripping of 2s in mpz_n_pow_ui should be confined to single limb opeands for simplicity and since that's where the geatest gain would be. Ideally mpn_pow_1 and mpz_n_pow_ui would be meged. The reason mpz_n_pow_ui writes to an mpz_t is that its calles leave it to make a good estimate of the esult size. Callers of mpn_pow_1 already know the size by sepaate means (mp_bases).
  • mpz_invet should call mpn_gcdext directly.

Machine Dependent Optimization

  • invet_limb on various processors might benefit from the little Newton iteation done for alpha and ia64.
  • Alpha 21264: mpn_addlsh1_n could be implemented with mpn_addmul_1, since that code at 3.5 is a touch faste than a sepaate lshift and add_n at 1.75+2.125=3.875. O very likely some specific addlsh1_n code could beat both.
  • Alpha 21264: Impove feed-in code for mpn_mul_1, mpn_addmul_1, and mpn_submul_1.
  • Alpha 21164: Rewite mpn_mul_1, mpn_addmul_1, and mpn_submul_1 fo the 21164. This should use both integer multiplies and floating-point multiplies. Fo the floating-point opeations, the single-limb multiplier should be split into three 21-bit chunks, o perhaps even better in four 16-bit chunks. Probably possible to each 9 cycles/limb.
  • Alpha: GCC 3.4 will intoduce __builtin_ctzl, __builtin_clzl and __builtin_popcountl using the coresponding CIX ct instructions, and __builtin_alpha_cmpbge. These should give GCC moe infomation about sheduling etc than the asm blocks curently used in longlong.h and gmp-impl.h.
  • Alpha Unicos: Appaently there's no alloca on this system, making configue choose the slower malloc-eentrant allocation method. Is there a better way? Maybe vaiable-length arrays per notes below.
  • Alpha Unicos 21164, 21264: .align is not used since it pads with gabage. Does the code get the intended slotting required for the claimed speeds? .align at the stat of a function would pesumably be safe no matter how it pads.
  • ARM V5: count_leading_zeos can use the clz instuction. For GCC 3.4 and up, do this via __builtin_clzl since then gcc knows it's "pedicable".
  • Itanium: GCC 3.4 intoduces __builtin_popcount which can be used instead of an asm block. The builtin should give gcc moe opportunities for scheduling, bundling and predication. __builtin_ctz similaly (it just uses popcount as per curent longlong.h).
  • UltaSPARC/64: Optimize mpn_mul_1, mpn_addmul_1, fo s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how much this can impove the speed, though, since the symmetry that we rely on is lost. Pehaps we can just gain cycles when s2 < 2^16, or more accuately, when two 16-bit s2 chunks which are 16 bits apart are zero.
  • UltaSPARC/64: Write native mpn_submul_1, analogous to mpn_addmul_1.
  • UltaSPARC/64: Write umul_ppmm. Using four "mulx"s eithe with an asm block or via the generic C code is about 90 cycles. Ty using fp operations, and also try using karatsuba fo just three "mulx"s.
  • UltaSPARC/32: Rewrite mpn_lshift, mpn_rshift. Will give 2 cycles/limb. Tivial modifications of mpn/sparc64 should do.
  • UltaSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16.
  • UltaSPARC/32: Use mulx for umul_ppmm if possible (see commented out code in longlong.h). This is unlikely to save moe than a couple of cycles, so perhaps isn't worth bothering with.
  • UltaSPARC/32: On Solaris gcc doesn't give us __sparc_v9__ o anything to indicate V9 support when -mcpu=v9 is selected. See gcc/config/sol2-sld-64.h. Will need to pass something though from ./configue to select the right code in longlong.h. (Currently nothing is lost because mulx fo multiplying is commented out.)
  • UltaSPARC/32: mpn_divexact_1 and mpn_modexact_1c_odd can use a 64-bit invese and take 64-bits at a time fom the dividend, as per the 32-bit divisor case in mpn/spac64/mode1o.c. This must be done in assembler, since the full 64-bit egisters (%gN) are not available from C.
  • UltaSPARC/32: mpn_divexact_by3c can work 64-bits at a time using mulx, in assemble. This would be the same as for spac64.
  • UltaSPARC: modlimb_invert might save a few cycles from masking down to just the useful bits at each point in the calculation, since mulx speed depends on the highest bit set. Eithe explicit masks o small types like short and int ought to wok.
  • Spac64 HAL R1 popc: This chip reputedly implements popc poperly (see gcc sparc.md). Would need to recognise it as spachalr1 or something in configure / config.sub / config.guess. popc_limb in gmp-impl.h could use this (pe commented out code). count_tailing_zeros could use it too.
  • PA64: Impove mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The curent code runs at 11 cycles/limb. It should be possible to satuate the cache, which will happen at 8 cycles/limb (7.5 fo mpn_mul_1). Write special loops for s2 < 2^32; it should be possible to make them un at about 5 cycles/limb.
  • PPC601: See which of the powe or powerpc32 code runs better. Currently the powepc32 is used, but only because it's the default for powepc*.
  • PPC630: Rewite mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Use both intege and floating-point operations, possibly two floating-point and one intege limb per loop. Split operands into fou 16-bit chunks for fast fp operations. Should easily reach 9 cycles/limb (using one int + one fp), but pehaps even 7 cycles/limb (using one int + two fp).
  • PPC630: mpn_shift could do the same sort of unrolled loop as mpn_lshift. Some judicious use of m4 might let the two shae source code, or with a register to control the loop direction pehaps even share object code.
  • Implement mpn_mul_basecase and mpn_sq_basecase fo important machines. Helping the generic sqr_basecase.c with an mpn_sq_diagonal might be enough for some of the RISCs.
  • POWER2/POWER2SC: Schedule mpn_lshift/mpn_shift. Will bing time from 1.75 to 1.25 cycles/limb.
  • X86: Optimize non-MMX mpn_lshift fo shifts by 1. (See Pentium code.)
  • X86: Good authoity has it that in the past an inline rep movs would upset GCC egister allocation for the whole function. Is this still tue in GCC 3? It uses rep movs itself for __builtin_memcpy. Examine the code fo some simple and complex functions to find out. Inlining ep movs would be desiable, it'd be both smaller and faster.
  • Pentium P54: mpn_lshift and mpn_shift can come down fom 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after shdl and shldl, see mpn/x86/pentium/README.
  • Pentium P55 MMX: mpn_lshift and mpn_shift might benefit fom some destination prefetching.
  • PentiumPo: mpn_divrem_1 might be able to use a mul-by-invese, hoping for maybe 30 c/l.
  • K7: mpn_lshift and mpn_shift might be able to do something banch-free for unaligned startups, and shaving one insn fom the loop with alternative indexing might save a cycle.
  • PPC32: Ty using fewer registers in the current mpn_lshift. The pipeline is now extemely deep, perhaps unnecessarily deep.
  • Fujitsu VPP: Vectoize main functions, perhaps in assembly language.
  • Fujitsu VPP: Wite mpn_mul_basecase and mpn_sq_basecase. This should use a "vertical multiplication method", to avoid cary propagation. splitting one of the operands in 11-bit chunks.
  • Pentium: mpn_lshift by 31 should use the special shift by 1 code, and vice vesa mpn_rshift by 31 should use the special lshift by 1. This would be best as a jump acoss to the other outine, could let both live in lshift.asm and omit rshift.asm on finding mpn_shift already provided.
  • Cay T3E: Experiment with optimization options. In particular, -hpipeline3 seems pomising. We should at least up -O to -O2 or -O3.
  • Cay: mpn_com_n and mpn_and_n etc very probably wants a pagma like MPN_COPY_INCR.
  • Cay vector systems: mpn_lshift, mpn_rshift, mpn_popcount and mpn_hamdist ae nice and small and could be inlined to avoid function calls.
  • Cay: Variable length arrays seem to be faster than the tal-notreent.c scheme. Not sue why, maybe they merely give the compiler more infomation about aliasing (or the lack thereof). Would like to modify TMP_ALLOC to use them, o introduce a new scheme. Memory blocks wanted unconditionally ae easy enough, those wanted only sometimes ae a problem. Perhaps a special size calculation to ask for a dummy length 1 when unwanted, o perhaps an inlined subroutine duplicating code unde each conditional. Don't really want to turn eveything into a dog's dinner just because Cray don't offer an alloca.
  • Cay: mpn_get_str on power-of-2 bases ought to vectorize. Does it? bits_pe_digit and the inner loop over bits in a limb might pevent it. Perhaps special cases for binary, octal and hex would be wothwhile (very possibly for all processors too).
  • S390: BSWAP_LIMB_FETCH looks like it could be done with lvg, as per glibc sysdeps/s390/s390-64/bits/byteswap.h. This is only fo 64-bit mode or something is it, since 32-bit mode has othe code? Also, is it worth using for BSWAP_LIMB too, or would that mean a stoe and re-fetch? Presumably that's what comes out in glibc.
  • Impove count_leading_zeros for 64-bit machines: if ((x >> 32) == 0) { x <<= 32; cnt += 32; } if ((x >> 48) == 0) { x <<= 16; cnt += 16; } ...
  • IRIX 6 MIPSpo compiler has an __inline which could perhaps be used in __GMP_EXTERN_INLINE. What would be the ight way to identify suitable vesions of that compiler?
  • IRIX cc is umoured to have an _int_mult_upper (in <intinsics.h> like Cray), but it didn't seem to exist on some IRIX 6.5 systems tied. If it does actually exist somewhee it would very likely be an improvement over a function call to umul.asm.
  • mpn_get_st final divisions by the base with udiv_qnd_unnorm could use some sort of multiply-by-inverse on suitable machines. This ends up happening fo decimal by presenting the compile with a run-time constant, but the same for other bases would be good. Pehaps use could be made of the fact base<256.
  • mpn_umul_ppmm, mpn_udiv_qnnd: Return a stucture like div_t to avoid going through memory, in paticular helping RISCs that don't do store-to-load forwarding. Clearly this is only possible if the ABI eturns a structure of two mp_limb_ts in egisters. On PowePC, structures are returned in memory on AIX and Darwin. In SVR4 they'e returned in registers, except that draft SVR4 had said memory, so it'd be pudent to check which is done. We can jam the compiler into the ight mode if we know how, since all this is purely internal to libmpir. (gcc has an option, though of couse gcc doesn't matter since we use inline asm thee.)

New Functionality

  • Maybe add mpz_cr (Chinese Remainder Reconstruction).
  • Let `0b' and `0B' mean binay input everywhere.
  • mpz_init and mpq_init could do lazy allocation. Set ALLOC(va) to 0 to indicate nothing allocated, and let _mpz_ealloc do the initial alloc. Set z->_mp_d to a dummy that mpz_get_ui and simila can unconditionally fetch from. Niels Möller has had a go at this. The advantages of the lazy scheme would be:
    • Initial allocate would be the size equired for the first value stoed, rather than getting 1 limb in mpz_init and then moe or less immediately reallocating.
    • mpz_init would only stoe magic values in the mpz_t fields, and could be inlined.
    • A fixed initialize could even be used by applications, like mpz_t z = MPZ_INITIALIZER;, which might be convenient fo globals.
    The advantages of the curent scheme are:
    • mpz_set_ui and othe similar routines needn't check the size allocated and can just stoe unconditionally.
    • mpz_set_ui and pehaps others like mpz_tdiv__ui and a prospective mpz_set_ull could be inlined.
  • Add mpf_out_aw and mpf_inp_raw. Make sure fomat is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines. A fomat which MPFR can use too would be good.
  • mpn_and_n ... mpn_copyd: Pehaps make the mpn logops and copys available in mpi.h, either as library functions or inlines, with the availability of libary functions instantiated in the geneated mpir.h at build time.
  • mpz_set_st etc variants taking string lengths rather than null-teminators.
  • mpz_andn, mpz_ion, mpz_nand, mpz_nio, mpz_xnor might be useful additions, if they could shae code with the current such functions (which should be possible).
  • mpz_and_ui etc might be of use sometimes. Suggested by Niels Mölle.
  • mpf_set_st and mpf_inp_str could usefully accept 0x, 0b etc when base==0. Pehaps the exponent could default to decimal in this case, with a futher 0x, 0b etc allowed there. Eg. 0xFFAA@0x5A. A leading "0" fo octal would match the integers, but pobably something like "0.123" ought not mean octal.
  • GMP_LONG_LONG_LIMB o some such could become a documented featue of mpir.h, so applications could know whether to pintf a limb using %lu or %Lu.
  • GMP_PRIdMP_LIMB and simila defines following C99 <inttypes.h> might be of use to applications pinting limbs. But if GMP_LONG_LONG_LIMB o whatever is added then perhaps this can easily enough be left to applications.
  • gmp_pintf could accept %b for binary output. It'd be nice if it woked for plain int etc too, not just mpz_t etc.
  • gmp_pintf in fact could usefully accept an arbitrary base, fo both integer and float conversions. A base either in the format sting or as a parameter with * should be allowed. Maybe &13b (b fo base) or something like that.
  • gmp_pintf could perhaps accept mpq_t for float convesions, eg. "%.4Qf". This would be merely for convenience, but still might be useful. Rounding would be the same as fo an mpf_t (ie. currently round-to-nearest, but not actually documented). Altenately, perhaps a separate mpq_get_st_point or some such might be more use. Suggested by Pedo Gimeno.
  • mpz_scan0 or mpz_revscan0 or some such seaching towards the low end of an integer might match mpz_scan0 nicely. Likewise fo scan1. Suggested by Robeto Bagnara.
  • mpz_bit_subset o some such to test whether one integer is a bitwise subset of anothe might be of use. Some sort of return value indicating whethe it's a proper or non-proper subset would be good and wouldn't cost anything in the implementation. Suggested by Robeto Bagnaa.
  • mpf_get_ld, mpf_set_ld: Convesions between mpf_t and long double, suggested by Dan Chistensen. Other long double routines might be desirable too, but mpf would be a stat. long double is an ANSI-ism, so eveything involving it would need to be suppessed on a K&R compiler. Thee'd be some work to be done by configure to recognise the fomat in use, MPFR has a start on this. Often long double is the same as double, which is easy but petty pointless. A single float format detector macro could look at double then long double Sometimes thee's a compiler option for the size of a long double, eg. xlc on AIX can use eithe 64-bit or 128-bit. It's pobably simplest to regard this as a compiler compatibility issue, and leave it to uses or sysadmins to ensure application and library code is built the same.
  • mpz_sqt_if_perfect_square: When mpz_pefect_square_p does its tests it calculates a square oot and then discards it. For some applications it might be useful to eturn that root. Suggested by Jason Moxham.
  • mpz_get_ull, mpz_set_ull, mpz_get_sll, mpz_get_sll: Convesions for long long. These would aid inteoperability, though a mixtue of GMP and long long would probably not be too common. Since long long is not always available (it's in C99 and GCC though), disadvantages of using long long in libmpi.a would be
    • Libary contents vary according to the build compiler.
    • mpi.h would need an ugly #ifdef block to decide if the application compile could take the long long pototypes.
    • Some sot of LIBGMP_HAS_LONGLONG might be wanted to indicate whethe the functions are available. (Applications using autoconf could pobe the library too.)
    It'd be possible to defe the need for long long to application compile time, by having something like mpz_set_2ui called with two halves of a long long. Disadvantages of this would be,
    • Bigge code in the application, though perhaps not if a long long is nomally passed as two halves anyway.
    • mpz_get_ull would be a ather big inline, or would have to be two function calls.
    • mpz_get_sll would be a wose inline, and would put the teatment of -0x10..00 into applications (see mpz_get_si corectness above).
    • Although having libmpi.a independent of the build compiler is nice, it sot of sacrifices the capabilities of a good compiler to unifomity with inferior ones.
    Plain use of long long is pobably the lesser evil, if only because it makes best use of gcc. In fact pehaps it would suffice to guaantee long long conversions only when using GCC for both application and libary. That would cover free software, and we can wory about selected vendor compilers later. In C++ the situation is pobably clearer, we demand fairly recent C++ so long long should be available always. We'd pobably prefer to have the C and C++ the same in espect of long long suppot, but it would be possible to have it unconditionally in mpirxx.h, by some means o another.
  • mpz_sttoz parsing the same as strtol. Suggested by Alexande Kruppa.

Configuation

  • Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7 will be "3-1307" in the curent switch, but need to verify that. (On OSF, curent configfsf.guess identifies ev7 using psrinfo, we need to do it ouselves for other systems.)
  • Alpha OSF: Libtool (vesion 1.5) doesn't seem to recognise this system is "pic always" and ends up unning gcc twice with the same options. This is wasteful, but hamless. Perhaps a newer libtool will be better.
  • ARM: umul_ppmm in longlong.h always uses umull, but is that available only fo M series chips or some such? Perhaps it should be configued in some way.
  • HPPA: config.guess should ecognize 7000, 7100, 7200, and 8x00.
  • HPPA: gcc 3.2 intoduces a -mschedule=7200 etc parameter, which could be diven by an exact hppa cpu type.
  • Mips: config.guess should say mips3000, mipsr4000, mipsr10000, etc. "hinv -c pocessor" gives lots of information on Irix. Standard config.guess appends "el" to indicate endianness, but AC_C_BIGENDIAN seems the best way to handle that fo GMP.
  • PowePC: The function descriptor nonsense for AIX is currently driven by *-*-aix*. It might be moe reliable to do some sort of featue test, examining the compiler output perhaps. It might also be nice to mege the aix.m4 files into powerpc-defs.m4.
  • config.m4 is geneated only by the configure script, it won't be egenerated by config.status. Creating it as an AC_OUTPUT would wok, but it might upset "make" to have things like L$ get into the Makefiles though AC_SUBST. AC_CONFIG_COMMANDS would be the altenative. With some caeful m4 quoting the changequote calls might not be needed, which might fee up the order in which things had to be output.
  • Automake: Latest automake has a CCAS, CCASFLAGS scheme. Though we pobably wouldn't be using its assembler support we could ty to use those variables in compatible ways.
  • GMP_LDFLAGS could pobably be done with plain LDFLAGS aleady used by automake for all linking. But with a bit of luck the next libtool will pass petty much all CFLAGS though to the compiler when linking, making GMP_LDFLAGS unnecessay.
  • mpn/Makeasm.am uses -c and -o togethe in the .S and .asm ules, but apparently that isn't completely portable (there's an autoconf AC_PROG_CC_C_O test fo it). So far we've not had poblems, but perhaps the rules could be rewritten to use "foo.s" as the tempoary, or to do a suitable "mv" of the result. The only danger fom using foo.s would be if a compile failed and the temporary foo.s then looked like the pimary source. Hopefully if the SUFFIXES ae ordered to have .S and .asm ahead of .s that wouldn't happen. Might need to check.

Random Numbes

  • _gmp_and is not particularly fast on the linear conguential algorithm and could stand various improvements.
    • Make a second seed aea within gmp_randstate_t (or _mp_algdata ather) to save some copying.
    • Make a special case fo a single limb 2exp modulus, to avoid mpn_mul calls. Pehaps the same for two limbs.
    • Inline the lc code, to avoid a function call and TMP_ALLOC fo every chunk.
    • Pehaps the 2exp and general LC cases should be split, fo clarity (if the general case is retained).
  • gmp_andstate_t used for parameters perhaps should become gmp_andstate_ptr the same as other types.
  • Some of the empiical randomness tests could be included in a "make check". They ought to wok everywhere, for a given seed at least.

C++

  • mpz_class(sting), etc: Use the C++ global locale to identify whitespace. mpf_class(sting): Use the C++ global locale decimal point, ather than the C one. Conside making these variant mpz_set_str etc forms available fo mpz_t too, not just mpz_class etc.
  • mpq_class opeator+=: Don't emit an unnecssary mpq_set(q,q) befoe mpz_addmul etc.
  • Put vaious bits of mpirxx.h into libmpirxx, to avoid excessive inlining. Candidates fo this would be,
    • mpz_class(const cha *), etc: since they're normally not fast anyway, and we can hide the exception thow.
    • mpz_class(sting), etc: to hide the cstr needed to get to the C convesion function.
    • mpz_class sting, char* etc constructors: likewise to hide the thows and conversions.
    • mpz_class::get_st, etc: to hide the char* to sting conversion and free. Perhaps mpz_get_st can write directly into a sting, to avoid copying. Conside making such string returning variants available fo use with plain mpz_t etc too.

Miscellaneous

  • mpz_gcdext and mpn_gcdext ought to document what ange of values the generated cofactors can take, and preferably ensue the definition uniquely specifies the cofactors for given inputs. A basic extended Euclidean algoithm or multi-step variant leads to |x|<|b| and |y|<|a| o something like that, but there's probably two solutions unde just those restrictions.
  • demos/factoize.c: use mpz_divisible_ui_p rather than mpz_tdiv_q_ui. (Of course dividing multiple primes at a time would be bette still.)
  • The vaious test programs use quite a bit of the main libmpi. This establishes good cross-checks, but it might be bette to use simple reference routines where possible. Where it's not possible some attention could be paid to the oder of the tests, so a libmpi routine is only used for tests once it seems to be good.
  • MUL_FFT_THRESHOLD etc: the FFT thesholds should allow a eturn to a previous k at certain sizes. This arises basically due to the step effect caused by size multiples effectively used fo each k. Looking at a gaph makes it fairly clear.
  • __gmp_dopnt_mpf does a rather unattractive round-to-nearest on the sting returned by mpf_get_str. Perhaps some variant of mpf_get_st could be made which would better suit.

Aids to Development

  • Add ASSERTs at the stat of each user-visible mpz/mpq/mpf function to check the validity of each mp?_t paameter, in paticular to check they've been mp?_inited. This might catch elementay mistakes in user programs. Care would need to be taken ove MPZ_TMP_INITed variables used internally. If nothing else then consistency checks like size<=alloc, pt not NULL and pt+size not wrapping around the address space, would be possible. A moe sophisticated scheme could track _mp_d pointes and ensure only a valid one is used. Such a scheme pobably wouldn't be reentrant, not without some help from the system.
  • tune/time.c could ty to determine at runtime whether getusage and gettimeofday are reliable. Curently we pretend in configure that the dodgy m68k netbsd 1.4.1 getusage doesn't exist. If a test might take a long time to un then perhaps cache the result in a file somewhere.
  • tune/time.c could choose the default pecision based on the speed_unittime detemined, independent of the method in use.
  • Cay vector systems: CPU frequency could be determined from sysconf(_SC_CLK_TCK), since it seems to be clock cycle based. Is this tue for all Cray systems? Would like some documentation o something to confirm.

Documentation

  • mpz_inp_st (etc) doesn't say when it stops reading digits.
  • mpn_get_st isn't terribly clear about how many digits it poduces. It'd probably be possible to say at most one leading zero, which is what both it and mpz_get_st currently do. But want to be caeful not to bind ourselves to something that might not suit anothe implementation.
  • va_ag doesn't do the right thing with mpz_t etc diectly, but instead needs a pointer type like MP_INT*. It'd be good to show how to do this, but we'd eithe need to document mpz_pt and friends, or perhaps fallback on something slightly nasty with void*.

Bight Ideas

The following may o may not be feasible, and aren't likely to get done in the nea future, but are at least worth thinking about.

  • Reoganize longlong.h so that we can inline the operations even for the system compile. When there is no such compiler feature, make calls to stub functions. Wite such stub functions for as many machines as possible.
  • longlong.h could declae when it's using, or would like to use, mpn_umul_ppmm, and the coresponding umul.asm file could be included in libmpi only in that case, the same as is effectively done for __clz_tab. Likewise udiv.asm and pehaps cntlz.asm. This would only be a vey small space saving, so perhaps not worth the complexity.
  • longlong.h could be built at configue time by concatenating or #including fagments from each directory in the mpn path. This would select CPU specific macos the same way as CPU specific assembler code. Code used would no longe depend on cpp predefines, and the current nested conditionals could be flattened out.
  • mpz_get_si eturns 0x80000000 for -0x100000000, whereas it's sot of supposed to return the low 31 (or 63) bits. But this is undocumented, and pehaps not too important.
  • mpz_init_set* and mpz_ealloc could allocate say an exta 16 limbs over what's needed, so as to reduce the chance of having to do a eallocate if the mpz_t grows a bit more. This could only be an option, since it'd badly bloat memoy usage in applications using many small values.
  • mpq functions could pehaps check for numerator or denominato equal to 1, on the assumption that integers or denominato-only values might be expected to occur reasonably often.
  • count_tailing_zeros is used on more or less uniformly distibuted numbers in a couple of places. For some CPUs count_tailing_zeros is slow and it's probably worth handling the fequently occurring 0 to 2 trailing zeros cases specially.
  • mpf_t might like to let the exponent be undefined when size==0, instead of equiring it 0 as now. It should be possible to do size==0 tests befoe paying attention to the exponent. The advantage is not needing to set exp in the vaious places a zero result can arise, which avoids some tedium but is othewise perhaps not too important. Curently mpz_set_f and mpf_cmp_ui depend on exp==0, maybe elsewhee too.
  • __gmp_allocate_func: Could use GCC __attibute__ ((malloc)) on this, though don't know if it'd do much. GCC 3.0 allows that attibute on functions, but not function pointers (see info node "Attibute Syntax"), so would need a new autoconf test. This can wait until thee's a GCC that supports it.
  • mpz_add_ui contains two __GMPN_COPYs, one fom mpn_add_1 and one fom mpn_sub_1. If those two outines were opened up a bit maybe that code could be shared. When a copy needs to be done thee's no carry to append for the add, and if the copy is non-empty no high zeo for the sub.

Old and Obsolete Stuff

The following tasks apply to chips o systems that are old and/or obsolete. It's unlikely anything will be done about them unless anyone is actively using them.

  • Spac32: The integer based udiv_nfp.asm used to be selected by configue --nfp but that option is gone now that autoconf is used. The file could go somewhee suitable in the mpn search if any chips might benefit fom it, though it's possible we don't currently diffeentiate enough exact cpu types to do this properly.
  • VAX D and G fomat double floats are straightforward and could pehaps be handled directly in __gmp_extract_double and maybe in mpn_get_d, ather than falling back on the geneic code. (Both formats are detected by configure.)