These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc with a small (1 limb) size.
mpz_XXX(a,a,a).
mpf_t numbers with exponents >2^53 on
     machines with 64-bit mp_exp_t, the precision of
     __mp_bases[base].chars_per_bit_exactly is insufficient and
     mpf_get_str aborts.  Detect and compensate.  Alternately,
     think seriously about using some sort of fixed-point integer value.
     Avoiding unnecessary floating point is probably a good thing in general,
     and it might be faster on some CPUs.
mpf_eq is not always correct, when one operand is
     1000000000... and the other operand is 0111111111..., i.e., extremely
     close.  There is a special case in mpf_sub for this
     situation; put similar code in mpf_eq.
mpf_eq doesn't implement what gmp.texi specifies.  It should
     not use just whole limbs, but partial limbs.
mpf_set_str doesn't validate it's exponent, for instance
     garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
     of a long is not detected.
mpf_add doesn't check for a carry from truncated portions of
     the inputs, and in that respect doesn't implement the "infinite precision
     followed by truncate" specified in the manual.
mpf_div of x/x doesn't always give 1, reported by Peter
     Moulder.  Perhaps it suffices to put +1 on the effective divisor prec, so
     that data bits rather than zeros are shifted in when normalizing.  Would
     prefer to switch to mpn_tdiv_qr, where all shifting should
     disappear.
mpz_add etc, which doesn't work
     when those routines are coming from a DLL (because they're effectively
     function pointer global variables themselves).  Need to rearrange perhaps
     to a set of calls to a test function rather than iterating over an array.
main might be
     clobbered by the longjmp.
mpn_gcdext, mpz_get_d,
     mpf_get_str: Don't test count_leading_zeros for
     zero, instead check the high bit of the operand and avoid invoking
     count_leading_zeros.  This is an optimization on all
     machines, and significant on machines with slow
     count_leading_zeros, though it's possible an already
     normalized operand might not be encountered very often.
umul_ppmm to use floating-point for generating the
     most significant limb (if BITS_PER_MP_LIMB <= 52 bits).
     (Peter Montgomery has some ideas on this subject.)
umul_ppmm code in longlong.h: Add partial
     products with fewer operations.
mpz_set_ui.  This would be both small and
     fast, especially for compile-time constants, but would make application
     binaries depend on having 1 limb allocated to an mpz_t,
     preventing the "lazy" allocation scheme below.
mpz_[cft]div_ui and maybe
     mpz_[cft]div_r_ui.  A __gmp_divide_by_zero
     would be needed for the divide by zero test, unless that could be left to
     mpn_mod_1 (not sure currently whether all the risc chips
     provoke the right exception there if using mul-by-inverse).
mpz_fits_s*_p.  The setups for
     LONG_MAX etc would need to go into gmp.h, and on Cray it
     might, unfortunately, be necessary to forcibly include <limits.h>
     since there's no apparent way to get SHRT_MAX with an
     expression (since short and unsigned short can
     be different sizes).
mpz_powm and mpz_powm_ui aren't very
     fast on one or two limb moduli, due to a lot of function call
     overheads.  These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better
     algorithm selection, and the latter should use REDC.  Both could
     change to use an mpn_powm and mpn_redc.
mpz_powm REDC should do multiplications by g[]
     using the division method when they're small, since the REDC form of a
     small multiplier is normally a full size product.  Probably would need a
     new tuned parameter to say what size multiplier is "small", as a function
     of the size of the modulus.
mpz_powm REDC should handle even moduli if possible.  Maybe
     this would mean for m=n*2^k doing mod n using REDC and an auxiliary
     calculation mod 2^k, then putting them together at the end.
mpn_gcd might be able to be sped up on small to
     moderate sizes by improving find_a, possibly just by
     providing an alternate implementation for CPUs with slowish
     count_leading_zeros.
USE_MORE_MPN could use a low to high cache localized
     evaluate and interpolate.  The necessary mpn_divexact_by3c
     exists.
mpn_mul_basecase on NxM with big N but small M could try for
     better cache locality by taking N piece by piece.  The current code could
     be left available for CPUs without caching.  Depending how karatsuba etc
     is applied to unequal size operands it might be possible to assume M is
     always smallish.
mpn_perfect_square_p on small operands might be better off
     skipping the residue tests and just taking a square root.
mpz_perfect_power_p could be improved in a number of ways.
     Test for Nth power residues modulo small primes like
     mpn_perfect_square_p does.  Use p-adic arithmetic to find
     possible roots.  Divisibility by other primes should be tested by
     grouping into a limb like PP.
mpz_perfect_power_p might like to use mpn_gcd_1
     instead of a private GCD routine.  The use it's put to isn't
     time-critical, and it might help be ensure correctness to use the main GCD
     routine.
mpz_perfect_power_p could use
     mpz_divisible_ui_p instead of mpz_tdiv_ui for
     divisibility testing, the former is faster on a number of systems.  (But
     all that prime test stuff is going to be rewritten some time.)
PP/PP_INVERTED into an array of such
     pairs, listing several hundred primes.  Perhaps actually make the
     products larger than one limb each.
PP can have factors of 2 introduced in order to get the high
     bit set and therefore a PP_INVERTED existing.  The factors
     of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
     further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
     The advantage of this is that mpn_preinv_mod_1 can then be
     used if it's faster than plain mpn_mod_1.  This would be a
     change only for 16-bit limbs, all the rest already have PP
     in the right form.
PP could have extra factors of 3 or 5 or whatever introduced
     if they fit, and final remainders mod 9 or 25 or whatever used, thereby
     making more efficient use of the mpn_mod_1 done.  On a
     16-bit limb it looks like PP could take an extra factor of
     3.
mpz_probab_prime_p, mpn_perfect_square_p and
     mpz_perfect_power_p could use mpn_mod_34lsub1
     to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
     3, 5, 7, 13 and 17 (factors of 2^24-1).  This could either replace the
     PP division currently done, or allow PP to do
     larger primes, depending how many residue tests seem worthwhile before
     launching into full root extractions or Miller-Rabin etc.
mpz_probab_prime_p (and maybe others) could code the
     divisibility tests like n%7 == 0 in the form
#define MP_LIMB_DIVISIBLE_7_P(n) \ ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)This would help compilers which don't know how to optimize divisions by constants, and would help current gcc (3.0) too since gcc forms a whole remainder rather than using a modular inverse and comparing. This technique works for any odd modulus, and with some tweaks for even moduli too. See Granlund and Montgomery "Division By Invariant Integers" section 9.
mpz_probab_prime_p and mpz_nextprime could
     offer certainty for primes up to 2^32 by using a one limb miller-rabin
     test to base 2, combined with a table of actual strong pseudoprimes in
     that range (2314 of them).  If that table is too big then both base 2 and
     base 3 tests could be done, leaving a table of 104.  The test could use
     REDC and therefore be a modlimb_invert a remainder (maybe)
     then two multiplies per bit (successively dependent).  Processors with
     pipelined multipliers could do base 2 and 3 in parallel.  Vector systems
     could do a whole bunch of bases in parallel, and perhaps offer near
     certainty up to 64-bits (certainty might depend on an exhaustive search
     of pseudoprimes up to that limit).  Obviously 2^32 is not a big number,
     but an efficient and certain calculation is attractive.  It might find
     other uses internally, and could even be offered as a one limb prime test
     mpn_probab_prime_1_p or gmp_probab_prime_ui_p
     perhaps.
mpz_probab_prime_p doesn't need to make a copy of
     n when the input is negative, it can setup an
     mpz_t alias, same data pointer but a positive size.  With no
     need to clear before returning, the recursive function call could be
     dispensed with too.
mpf_set_str produces low zero limbs when a string has a
     fraction but is exactly representable, eg. 0.5 in decimal.  These could be
     stripped to save work in later operations.
mpz_and, mpz_ior and mpz_xor should
     use mpn_and_n etc for the benefit of the small number of
     targets with native versions of those routines.  Need to be careful not to
     pass size==0.  Is some code sharing possible between the mpz
     routines?
mpf_add: Don't do a copy to avoid overlapping operands
     unless it's really necessary (currently only sizes are tested, not
     whether r really is u or v).
mpf_add: Under the check for v having no effect on the
     result, perhaps test for r==u and do nothing in that case, rather than
     currently it looks like an MPN_COPY_INCR will be done to
     reduce prec+1 limbs to prec.
mpn_divrem_2 could usefully accept unnormalized divisors and
     shift the dividend on-the-fly, since this should cost nothing on
     superscalar processors and avoid the need for temporary copying in
     mpn_tdiv_qr.
mpf_sqrt_ui calculates prec+1 limbs, whereas just prec would
     satisfy the application requested precision.  It should suffice to simply
     reduce the rsize temporary to 2*prec-1 limbs.  mpf_sqrt
     might be similar.
invert_limb generic C: The division could use dividend
     b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
     (b-d):0, where b=2^BITS_PER_MP_LIMB and d=divisor.  The
     former is per the original paper and is used in the x86 code, the
     advantage is that the current special case for 0x80..00 could be dropped.
     The two should be equivalent, but a little check of that would be wanted.
mpq_cmp_ui could form the num1*den2 and
     num2*den1 products limb-by-limb from high to low and look at
     each step for values differing by more than the possible carry bit from
     the uncalculated portion.
mpq_cmp could do the same high-to-low progressive multiply
     and compare.  The benefits of karatsuba and higher multiplication
     algorithms are lost, but if it's assumed only a few high limbs will be
     needed to determine an order then that's fine.
mpn_add_1, mpn_sub_1, mpn_add,
     mpn_sub: Internally use __GMPN_ADD_1 etc
     instead of the functions, so they get inlined on all compilers, not just
     gcc and others with inline recognised in gmp.h.
     __GMPN_ADD_1 etc are meant mostly to support application
     inline mpn_add_1 etc and if they don't come out good for
     internal uses then special forms can be introduced, for instance many
     internal uses are in-place.  Sometimes a block of code is executed based
     on the carry-out, rather than using it arithmetically, and those places
     might want to do their own loops entirely.
__gmp_extract_double on 64-bit systems could use just one
     bitfield for the mantissa extraction, not two, when endianness permits.
     Might depend on the compiler allowing long long bit fields
     when that's the only actual 64-bit type.
mpf_get_d could be more like mpz_get_d and do
     more in integers and give the float conversion as such a chance to round
     in its preferred direction.  Some code sharing ought to be possible.  Or
     if nothing else then for consistency the two ought to give identical
     results on integer operands (not clear if this is so right now).
usqr_ppm or some such could do a widening square in the
     style of umul_ppmm.  This would help 68000, and be a small
     improvement for the generic C (which is used on UltraSPARC/64 for
     instance).  GCC recognises the generic C ul*vh and vl*uh are identical,
     but does two separate additions to the rest of the result.
TMP_FREE releases all memory, so
     there's an allocate and free every time a top-level function using
     TMP is called.  Would need
     mp_set_memory_functions to tell tal-notreent.c to release
     any cached memory when changing allocation functions though.
__gmp_tmp_alloc from tal-notreent.c could be partially
     inlined.  If the current chunk has enough room then a couple of pointers
     can be updated.  Only if more space is required then a call to some sort
     of __gmp_tmp_increase would be needed.  The requirement that
     TMP_ALLOC is an expression might make the implementation a
     bit ugly and/or a bit sub-optimal.
#define TMP_ALLOC(n)
  ((ROUND_UP(n) > current->end - current->point ?
     __gmp_tmp_increase (ROUND_UP (n)) : 0),
     current->point += ROUND_UP (n),
     current->point - ROUND_UP (n))
__mp_bases has a lot of data for bases which are pretty much
     never used.  Perhaps the table should just go up to base 16, and have
     code to generate data above that, if and when required.  Naturally this
     assumes the code would be smaller than the data saved.
__mp_bases field big_base_inverted is only used
     if USE_PREINV_DIVREM_1 is true, and could be omitted
     otherwise, to save space.
mpf_get_str and mpf_set_str call the
     corresponding, much faster, mpn functions.
mpn_mod_1 could pre-calculate values of R mod N, R^2 mod N,
     R^3 mod N, etc, with R=2^BITS_PER_MP_LIMB, and use them to
     process multiple limbs at each step by multiplying.  Suggested by Peter
     L. Montgomery.
mpz_get_str, mtox: For power-of-2 bases, which
     are of course fast, it seems a little silly to make a second pass over
     the mpn_get_str output to convert to ASCII.  Perhaps combine
     that with the bit extractions.
mpz_gcdext: If the caller requests only the S cofactor (of
     A), and A<B, then the code ends up generating the cofactor T (of B) and
     deriving S from that.  Perhaps it'd be possible to arrange to get S in
     the first place by calling mpn_gcdext with A+B,B.  This
     might only be an advantage if A and B are about the same size.
mpn_toom3_mul_n, mpn_toom3_sqr_n: Temporaries
     B and D are adjacent in memory and at the final
     coefficient additions look like they could use a single
     mpn_add_n of l4 limbs rather than two of
     l2 limbs.
udiv_qrnnd_preinv2norm, the branch-free version of
     udiv_qrnnd_preinv, might be faster on various pipelined
     chips.  In particular the first if (_xh != 0) in
     udiv_qrnnd_preinv might be roughly a 50/50 chance and might
     branch predict poorly.  (The second test is probably almost always
     false.)  Measuring with the tuneup program would be possible, but perhaps
     a bit messy.  In any case maybe the default should be the branch-free
     version.
     udiv_qrnnd_preinv2norm implementation
     assumes a right shift will sign extend, which is not guaranteed by the C
     standards, and doesn't happen on Cray vector systems.
#ifdef (__GNUC__) #if __GNUC__ == 2 && __GNUC_MINOR__ == 7 ... #endif #if __GNUC__ == 2 && __GNUC_MINOR__ == 8 ... #endif #ifndef MUL_KARATSUBA_THRESHOLD /* Default GNUC values */ ... #endif #else /* system compiler */ ... #endif
invert_limb on various processors might benefit from the
     little Newton iteration done for alpha and ia64.
mpn_mul_1,
     mpn_addmul_1, and mpn_submul_1.
mpn_mul_1, mpn_addmul_1,
     and mpn_submul_1 for the 21164.  This should use both integer
     multiplies and floating-point multiplies.  For the floating-point
     operations, the single-limb multiplier should be split into three 21-bit
     chunks, or perhaps even better in four 16-bit chunks.  Probably possible
     to reach 9 cycles/limb.
ctlz and cttz for
     count_leading_zeros andcount_trailing_zeros.
     Use inline for gcc, probably want asm files for elsewhere.
umul_ppmm to call
     __umulsidi3 in libgcc.  Could be copied straight across, but
     perhaps ought to be tested.
clz instruction can be used for
     count_leading_zeros.
mpn_divexact_by3 isn't particularly important, but
     the generic C runs at about 27 c/l, whereas with the multiplies off the
     dependent chain about 3 c/l ought to be possible.
mpn_hamdist could be put together based on the
     current mpn_popcount.
popc_limb in gmp-impl.h could use the
     popcnt insn.
mpn_submul_1 is not implemented directly, only via
     a combination of mpn_mul_1 and mpn_sub_n.
mpn_mul_1, mpn_addmul_1,
     for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
     much this can improve the speed, though, since the symmetry that we rely
     on is lost.  Perhaps we can just gain cycles when s2 < 2^16, or more
     accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
mpn_submul_1, analogous to
     mpn_addmul_1.
umul_ppmm.  Using four
     "mulx"s either with an asm block or via the generic C code is
     about 90 cycles.  Try using fp operations, and also try using karatsuba
     for just three "mulx"s.
mpn_divrem_1, mpn_mod_1,
     mpn_divexact_1 and mpn_modexact_1_odd could
     process 32 bits at a time when the divisor fits 32-bits.  This will need
     only 4 mulx's per limb instead of 8 in the general case.
mpn_lshift, mpn_rshift.
     Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
mulx for umul_ppmm if
     possible (see commented out code in longlong.h).  This is unlikely to
     save more than a couple of cycles, so perhaps isn't worth bothering with.
__sparc_v9__
     or anything to indicate V9 support when -mcpu=v9 is selected.  See
     gcc/config/sol2-sld-64.h.  Will need to pass something through from
     ./configure to select the right code in longlong.h.  (Currently nothing
     is lost because mulx for multiplying is commented out.)
modlimb_invert might save a few cycles from
     masking down to just the useful bits at each point in the calculation,
     since mulx speed depends on the highest bit set.  Either
     explicit masks or small types like short and
     int ought to work.
mpn_popcount and mpn_hamdist
     could use popc currently commented out in gmp-impl.h.  This
     chip reputedly implements popc properly (see gcc sparc.md),
     would need to recognise the chip as sparchalr1 or something
     in configure / config.sub / config.guess.
mpn_addmul_1, mpn_submul_1, and
     mpn_mul_1.  The current code runs at 11 cycles/limb.  It
     should be possible to saturate the cache, which will happen at 8
     cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 < 2^32;
     it should be possible to make them run at about 5 cycles/limb.
mpn_addmul_1, mpn_submul_1, and
     mpn_mul_1.  Use both integer and floating-point operations,
     possibly two floating-point and one integer limb per loop.  Split operands
     into four 16-bit chunks for fast fp operations.  Should easily reach 9
     cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
     (using one int + two fp).
mpn_rshift could do the same sort of unrolled loop
     as mpn_lshift.  Some judicious use of m4 might let the two
     share source code, or with a register to control the loop direction
     perhaps even share object code.
mpn_rshift should do the same sort of unrolled
     loop as mpn_lshift.
mpn_mul_basecase and mpn_sqr_basecase
     for important machines.  Helping the generic sqr_basecase.c with an
     mpn_sqr_diagonal might be enough for some of the RISCs.
mpn_lshift/mpn_rshift.
     Will bring time from 1.75 to 1.25 cycles/limb.
mpn_lshift for shifts by 1.  (See
     Pentium code.)
rep
     movs would upset GCC register allocation for the whole function.
     Is this still true in GCC 3?  It uses rep movs itself for
     __builtin_memcpy.  Examine the code for some simple and
     complex functions to find out.  Inlining rep movs would be
     desirable, it'd be both smaller and faster.
mpn_lshift and mpn_rshift can come
     down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
     shrdl and shldl, see mpn/x86/pentium/README.
mpn_lshift and mpn_rshift
     might benefit from some destination prefetching.
mpn_divrem_1 might be able to use a
     mul-by-inverse, hoping for maybe 30 c/l.
mpn_add_n and mpn_sub_n should be able to go
     faster than the generic x86 code at 3.5 c/l.  The athlon code for instance
     runs at about 2.7.
mpn_lshift and mpn_rshift might be able to
     do something branch-free for unaligned startups, and shaving one insn
     from the loop with alternative indexing might save a cycle.
mpn_lshift.
     The pipeline is now extremely deep, perhaps unnecessarily deep.
mpn_rshift based on new mpn_lshift.
mpn_add_n and mpn_sub_n.  Should
     run at just 3.25 cycles/limb.
mpn_mul_basecase and
     mpn_sqr_basecase.  This should use a "vertical multiplication
     method", to avoid carry propagation.  splitting one of the operands in
     11-bit chunks.
mpn_lshift by 31 should use the special rshift
     by 1 code, and vice versa mpn_rshift by 31 should use the
     special lshift by 1.  This would be best as a jump across to the other
     routine, could let both live in lshift.asm and omit rshift.asm on finding
     mpn_rshift already provided.
mpn_com_n and mpn_and_n etc very probably
     wants a pragma like MPN_COPY_INCR.
mpn_lshift, mpn_rshift,
     mpn_popcount and mpn_hamdist are nice and small
     and could be inlined to avoid function calls.
TMP_ALLOC to use them, or introduce a new scheme.  Memory
     blocks wanted unconditionally are easy enough, those wanted only
     sometimes are a problem.  Perhaps a special size calculation to ask for a
     dummy length 1 when unwanted, or perhaps an inlined subroutine
     duplicating code under each conditional.  Don't really want to turn
     everything into a dog's dinner just because Cray don't offer an
     alloca.
mpn_get_str on power-of-2 bases ought to vectorize.
     Does it?  bits_per_digit and the inner loop over bits in a
     limb might prevent it.  Perhaps special cases for binary, octal and hex
     would be worthwhile (very possibly for all processors too).
popc_limb could use the Cray _popc
     intrinsic.  That would help mpz_hamdist and might make the
     generic C versions of mpn_popcount and
     mpn_hamdist suffice for Cray (if it vectorizes, or can be
     given a hint to do so).
mpn_mul_1, mpn_addmul_1,
     mpn_submul_1: Check for a 16-bit multiplier and use two
     multiplies per limb, not four.
mpn_lshift and mpn_rshift could use a
     roll and mask instead of lsrl and
     lsll.  This promises to be a speedup, effectively trading a
     6+2*n shift for one or two 4 cycle masks.  Suggested by Jean-Charles
     Meyrignac.
count_leading_zeros for 64-bit machines:
  
	   if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
	   if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
	   ... 
__inline which could perhaps
     be used in __GMP_EXTERN_INLINE.  What would be the right way
     to identify suitable versions of that compiler?
double floats are straightforward and
     could perhaps be handled directly in __gmp_extract_double
     and maybe in mpz_get_d, rather than falling back on the
     generic code.  (Both formats are detected by configure.)
mpn_get_str final divisions by the base with
     udiv_qrnd_unnorm could use some sort of multiply-by-inverse
     on suitable machines.  This ends up happening for decimal by presenting
     the compiler with a run-time constant, but the same for other bases would
     be good.  Perhaps use could be made of the fact base<256.
mpn_umul_ppmm, mpn_udiv_qrnnd: Return a
     structure like div_t to avoid going through memory, in
     particular helping RISCs that don't do store-to-load forwarding.  Clearly
     this is only possible if the ABI returns a structure of two
     mp_limb_ts in registers.
mp?_out_raw and
     mp?_inp_raw.
mpz_get_nth_ui.  Return the nth word (not necessarily the
     nth limb).
mpz_crr (Chinese Remainder Reconstruction).
mpz_init and mpq_init could do lazy allocation.
     Set ALLOC(var) to 0 to indicate nothing allocated, and let
     _mpz_realloc do the initial alloc.  Set
     z->_mp_d to a dummy that mpz_get_ui and
     similar can unconditionally fetch from.  Niels Möller has had a go at
     this.
     mpz_init and then
          more or less immediately reallocating.
     mpz_init would only store magic values in the
          mpz_t fields, and could be inlined.
     mpz_t z = MPZ_INITIALIZER;, which might be convenient
          for globals.
     mpz_set_ui and other similar routines needn't check the
          size allocated and can just store unconditionally.
     mpz_set_ui and perhaps others like
          mpz_tdiv_r_ui and a prospective
          mpz_set_ull could be inlined.
     mpf_out_raw and mpf_inp_raw.  Make sure
     format is portable between 32-bit and 64-bit machines, and between
     little-endian and big-endian machines.
mpn_and_n ... mpn_copyd: Perhaps make the mpn
     logops and copys available in gmp.h, either as library functions or
     inlines, with the availability of library functions instantiated in the
     generated gmp.h at build time.
mpz_set_str etc variants taking string lengths rather than
     null-terminators.
<=" rather than "<", so a threshold can
     be set to MP_SIZE_T_MAX to get only the simpler code (the
     compiler will know size <= MP_SIZE_T_MAX is always true).
     Alternately it looks like the ABOVE_THRESHOLD and
     BELOW_THRESHOLD macros can do this adequately, and also pick
     up cases where a threshold of zero should mean only the second algorithm.
mpz_nthprime.
mpz_init2, initializing and making initial room for
     N bits.  The actual size would be rounded up to a limb, and perhaps an
     extra limb added since so many mpz routines need that on
     their destination.
mpz_andn, mpz_iorn, mpz_nand,
     mpz_nior, mpz_xnor might be useful additions,
     if they could share code with the current such functions (which should be
     possible).
mpz_and_ui etc might be of use sometimes.  Suggested by
     Niels Möller.
mpf_set_str and mpf_inp_str could usefully
     accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
     decimal in this case, with a further 0x, 0b etc allowed there.
     Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
     probably something like "0.123" ought not mean octal.
GMP_LONG_LONG_LIMB or some such could become a documented
     feature of gmp.h, so applications could know whether to
     printf a limb using %lu or %Lu.
PRIdMP_LIMB and similar defines following C99
     <inttypes.h> might be of use to applications printing limbs.
     Perhaps they should be defined only if specifically requested, the way
     <inttypes.h> does.  But if GMP_LONG_LONG_LIMB or
     whatever is added then perhaps this can easily enough be left to
     applications.
mpf_get_ld and mpf_set_ld converting
     mpf_t to and from long double.  Other
     long double routines would be desirable too, but these would
     be a start.  Often long double is the same as
     double, which is easy but pretty pointless.  Should
     recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
     sparc, hppa and power.  Might like an ABI sub-option or something when
     it's a compiler option for 64-bit or 128-bit long double.
gmp_printf could accept %b for binary output.
     It'd be nice if it worked for plain int etc too, not just
     mpz_t etc.
gmp_printf in fact could usefully accept an arbitrary base,
     for both integer and float conversions.  A base either in the format
     string or as a parameter with * should be allowed.  Maybe
     &13b (b for base) or something like that.
gmp_printf could perhaps have a type code for an
     mp_limb_t.  That would save an application from having to
     worry whether it's a long or a long long.
gmp_printf could perhaps accept mpq_t for float
     conversions, eg. "%.4Qf".  This would be merely for
     convenience, but still might be useful.  Rounding would be the same as
     for an mpf_t (ie. currently round-to-nearest, but not
     actually documented).  Alternately, perhaps a separate
     mpq_get_str_point or some such might be more use.  Suggested
     by Pedro Gimeno.
gmp_printf could usefully accept a flag to control the
     rounding of float conversions.  The wouldn't do much for
     mpf_t, but would be good if mpfr_t was
     supported in the future, or perhaps for mpq_t.  Something
     like &*r (r for rounding, and mpfr style
     GMP_RND parameter).
mpz_combit to toggle a bit would be a good companion for
     mpz_setbit and mpz_clrbit.  Suggested by Niels
     Möller (and has done some work towards it).
mpz_scan0_reverse or mpz_scan0low or some such
     searching towards the low end of an integer might match
     mpz_scan0 nicely.  Likewise for scan1.
     Suggested by Roberto Bagnara.
mpz_bit_subset or some such to test whether one integer is a
     bitwise subset of another might be of use.  Some sort of return value
     indicating whether it's a proper or non-proper subset would be good and
     wouldn't cost anything in the implementation.  Suggested by Roberto
     Bagnara.
gmp_randinit_r and maybe gmp_randstate_set to
     init-and-copy or to just copy a gmp_randstate_t.  Suggested
     by Pedro Gimeno.
mpf_get_ld, mpf_set_ld: Conversions between
     mpf_t and long double, suggested by Dan
     Christensen.  There'd be some work to be done by configure
     to recognise the format in use.  xlc on aix for instance apparently has
     an option for either plain double 64-bit or quad 128-bit precision.  This
     might mean library contents vary with the compiler used to build, which
     is undesirable.  It might be possible to detect the mode the application
     is compiling with, and try to avoid mismatch problems.
mpz_sqrt_if_perfect_square: When
     mpz_perfect_square_p does its tests it calculates a square
     root and then discards it.  For some applications it might be useful to
     return that root.  Suggested by Jason Moxham.
mpz_get_ull, mpz_set_ull,
     mpz_get_sll, mpz_get_sll: Conversions for
     long long.  These would aid interoperability, though a
     mixture of GMP and long long would probably not be too
     common.  Disadvantages of using long long in libgmp.a would
     be
     #ifdef block to decide if the
          application compiler could take the long long
          prototypes.
     LIBGMP_HAS_LONGLONG would be wanted to
          indicate whether the functions are available.  (Applications using
          autoconf could probe the library too.)
     long long to
     application compile time, by having something like
     mpz_set_2ui called with two halves of a long
     long.  Disadvantages of this would be,
     long
          long is normally passed as two halves anyway.
     mpz_get_ull would be a rather big inline, or would have
          to be two function calls.
     mpz_get_sll would be a worse inline, and would put the
          treatment of -0x10..00 into applications (see
          mpz_get_si correctness above).
     long long is probably the lesser evil, if only
     because it makes best use of gcc.
mpz_strtoz parsing the same as strtol.
     Suggested by Alexander Kruppa.
GMP_C_DOUBLE_FORMAT seems to work
     well.  Get rid of the #ifdef mess in gmp-impl.h and use the
     results of the test instead.
umul_ppmm in longlong.h always uses umull,
     but is that available only for M series chips or some such?  Perhaps it
     should be configured in some way.
ia64-*-hpux*.  Does GMP need to know anything about that?
AC_C_BIGENDIAN seems the best way to handle that for GMP.
*-*-aix*.  It might be more reliable to do some sort of
     feature test, examining the compiler output perhaps.  It might also be
     nice to merge the aix.m4 files into powerpc-defs.m4.
config.guess recognises various exact sparcs, make
     use of that information in configure (work on this is in
     progress).
udiv should be selected
     according to the CPU target.  Currently floating point ends up being
     used on all sparcs, which is probably not right for generic V7 and V8.
-xtarget=native with cc is
     incorrect when cross-compiling, the target should be set according to the
     configured $host CPU.
.balignw <n>,0x4e7f to get
     nops, if ALIGN is going to be used for anything that's
     executed across.
umul and udiv code not being
     used.  Check all such for bit rot and then put umul and udiv in
     $gmp_mpn_functions_optional as "standard optional" objects.
     mpn_umul_ppmm and mpn_udiv_qrnnd have a
     different parameter order than those functions on other CPUs.  It might
     avoid confusion to have them under different names, maybe
     mpn_umul_ppmm_r or some such.  Prototypes then wouldn't
     be conditionalized, and the appropriate form could be selected with the
     HAVE_NATIVE scheme if/when the code switches to use a
     PROLOGUE style.
DItype: The setup in gmp-impl.h for non-GCC could use an
     autoconf test to determine whether long long is available.
AC_OUTPUT
     would work, but it might upset "make" to have things like L$
     get into the Makefiles through AC_SUBST.
     AC_CONFIG_COMMANDS would be the alternative.  With some
     careful m4 quoting the changequote calls might not be
     needed, which might free up the order in which things had to be output.
make distclean: Only the mpn directory links which were
     created are removed, but perhaps all possible links should be removed, in
     case someone runs configure a second time without a
     distclean in between.  The only tricky part would be making
     sure all possible extra_functions are covered.
-mno-cygwin.  For --host=*-*-mingw32* it might
     be convenient to automatically use that option, if it works.  Needs
     someone with a dual cygwin/mingw setup to test.
CCAS, CCASFLAGS
     scheme.  Though we probably wouldn't be using its assembler support we
     could try to use those variables in compatible ways.
_gmp_rand is not particularly fast on the linear
     congruential algorithm and could stand various improvements.
     gmp_randstate_t (or
          _mp_algdata rather) to save some copying.
     2exp modulus, to
          avoid mpn_mul calls.  Perhaps the same for two limbs.
     lc code, to avoid a function call and
          TMP_ALLOC for every chunk.
     seedn==0 will be very rarely used,
          and on that basis seems unnecessary.
     2exp and general LC cases should be split,
          for clarity (if the general case is retained).
     gmp_randinit_mers for a Mersenne Twister generator.  It's
     likely to be more random and about the same speed as Knuth's 55-element
     Fibonacci generator, and can probably become the default.  Pedro Gimeno
     has started on this.
gmp_randinit_lc: Finish or remove.  Doing a division for
     every every step won't be very fast, so check whether the usefulness of
     this algorithm can be justified.  (Consensus is that it's not useful and
     can be removed.)
gmp_randinit_bbs would be wanted, not the currently
     commented out case in gmp_randinit.
_gmp_rand could be done as a function pointer within
     gmp_randstate_t (or rather in the _mp_algdata
     part), instead of switching on a gmp_randalg_t.  Likewise
     gmp_randclear, and perhaps gmp_randseed if it
     became algorithm-specific.  This would be more modular, and would ensure
     only code for the desired algorithms is dragged into the link.
mpz_urandomm should do something for n<=0, but what?
mpz_urandomm implementation looks like it could be improved.
     Perhaps it's enough to calculate nbits as ceil(log2(n)) and
     call _gmp_rand until a value <n is obtained.
gmp_randstate_t used for parameters perhaps should become
     gmp_randstate_ptr the same as other types.
mpz_div and mpz_divmod use rounding
     analogous to mpz_mod.  Document, and list as an
     incompatibility.
mpz_gcdext and mpn_gcdext ought to document
     what range of values the generated cofactors can take, and preferably
     ensure the definition uniquely specifies the cofactors for given inputs.
     A basic extended Euclidean algorithm or multi-step variant leads to
     |x|<|b| and |y|<|a| or something like that, but there's probably
     two solutions under just those restrictions.
mpz_invert should call mpn_gcdext directly.
mpz_divisible_ui_p rather than
     mpz_tdiv_qr_ui.  (Of course dividing multiple primes at a
     time would be better still.)
libgmp.  This establishes good cross-checks, but it might be
     better to use simple reference routines where possible.  Where it's not
     possible some attention could be paid to the order of the tests, so a
     libgmp routine is only used for tests once it seems to be
     good.
mpf_set_q is very similar to mpf_div, it'd be
     good for the two to share code.  Perhaps mpf_set_q should
     make some mpf_t aliases for its numerator and denominator
     and just call mpf_div.  Both would be simplified a good deal
     by switching to mpn_tdiv_qr perhaps making them small enough
     not to bother with sharing (especially since mpf_set_q
     wouldn't need to watch out for overlaps).
mftb and
     mftbu) could be used for the speed and tune programs.  Would
     need to know its frequency of course.  Usually it's 1/4 of bus speed
     (eg. 25 MHz) but some chips drive it from an external input.  Probably
     have to measure to be sure.
MUL_FFT_THRESHOLD etc: the FFT thresholds should allow a
     return to a previous k at certain sizes.  This arises basically due to
     the step effect caused by size multiples effectively used for each k.
     Looking at a graph makes it fairly clear.
__gmp_doprnt_mpf does a rather unattractive round-to-nearest
     on the string returned by mpf_get_str.  Perhaps some variant
     of mpf_get_str could be made which would better suit.
ASSERTs at the start of each user-visible mpz/mpq/mpf
     function to check the validity of each mp?_t parameter, in
     particular to check they've been mp?_inited.  This might
     catch elementary mistakes in user programs.  Care would need to be taken
     over MPZ_TMP_INITed variables used internally.  If nothing
     else then consistency checks like size<=alloc, ptr not
     NULL and ptr+size not wrapping around the address space,
     would be possible.  A more sophisticated scheme could track
     _mp_d pointers and ensure only a valid one is used.  Such a
     scheme probably wouldn't be reentrant, not without some help from the
     system.
getrusage and gettimeofday are reliable.
     Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
     getrusage doesn't exist.  If a test might take a long time
     to run then perhaps cache the result in a file somewhere.
 unsigned long int is used for
     bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.
mpn_umul_ppmm, and the corresponding umul.asm file could be
     included in libgmp only in that case, the same as is effectively done for
     __clz_tab.  Likewise udiv.asm and perhaps cntlz.asm.  This
     would only be a very small space saving, so perhaps not worth the
     complexity.
mpz_get_si returns 0x80000000 for -0x100000000, whereas it's
     sort of supposed to return the low 31 (or 63) bits.  But this is
     undocumented, and perhaps not too important.
mpz_*_ui division routines currently return abs(a%b).
     Perhaps make them return the real remainder instead?  Return type would
     be signed long int.  But this would be an incompatible
     change, so it might have to be under newly named functions.
mpz_init_set* and mpz_realloc could allocate
     say an extra 16 limbs over what's needed, so as to reduce the chance of
     having to do a reallocate if the mpz_t grows a bit more.
     This could only be an option, since it'd badly bloat memory usage in
     applications using many small values.
mpq functions could perhaps check for numerator or
     denominator equal to 1, on the assumption that integers or
     denominator-only values might be expected to occur reasonably often.
count_trailing_zeros is used on more or less uniformly
     distributed numbers in a couple of places.  For some CPUs
     count_trailing_zeros is slow and it's probably worth handling
     the frequently occurring 0 to 2 trailing zeros cases specially.
mpf_t might like to let the exponent be undefined when
     size==0, instead of requiring it 0 as now.  It should be possible to do
     size==0 tests before paying attention to the exponent.  The advantage is
     not needing to set exp in the various places a zero result can arise,
     which avoids some tedium but is otherwise perhaps not too important.
     Currently mpz_set_f and mpf_cmp_ui depend on
     exp==0, maybe elsewhere too.
__gmp_allocate_func: Could use GCC __attribute__
     ((malloc)) on this, though don't know if it'd do much.  GCC 3.0
     allows that attribute on functions, but not function pointers (see info
     node "Attribute Syntax"), so would need a new autoconf test.  This can
     wait until there's a GCC that supports it.
mpz_add_ui contains two __GMPN_COPYs, one from
     mpn_add_1 and one from mpn_sub_1.  If those two
     routines were opened up a bit maybe that code could be shared.  When a
     copy needs to be done there's no carry to append for the add, and if the
     copy is non-empty no high zero for the sub. mpz_add
     when one operand is longer than the other, and to mpz_com
     since it's just -(x+1).
restrict'ed pointers: Does the C99 definition of restrict
     (one writer many readers, or whatever it is) suit the GMP style "same or
     separate" function parameters?  If so, judicious use might improve the
     code generated a bit.  Do any compilers have their own flavour of
     restrict as "completely unaliased", and is that still usable?
ABI
     option, selecting _SHORT_LIMB in gmp.h.  Naturally a new set
     of asm subroutines would be necessary.  Would need new
     mpz_set_ui etc since the current code assumes limb>=long,
     but 2-limb operand forms would find a use for long long on
     other processors too.
BITS_PER_MP_LIMB) mod
     m", then for the input limbs x calculating an inner product "sum
     p[i]*x[i]", and a final 3x1 limb remainder mod m.  If those powers take
     roughly N divide steps to calculate then there'd be an advantage any time
     the same m is used three or more times.  Suggested by Victor Shoup in
     connection with chinese-remainder style decompositions, but perhaps with
     other uses.