[CONTACT]

[ABOUT]

[POLICY]

[ADVERTISE]

xblas.txt

Found at: ftp.icm.edu.pl:70/packages/netlib/lapack/xblas.txt


		
XBLAS - Extra Precise Basic Linear Algebra Subroutines
======================================================

		
1. Introduction
---------------

		
This library of routines is part of a reference implementation for the Dense
and Banded BLAS routines, along with their Extended and Mixed Precision
versions, as documented in Chapters 2 and 4 of the new BLAS Standard, which is
available from: http://www.netlib.org/blas/blast-forum/.

		
*EXTENDED PRECISION* is only used internally; the input and output arguments
remain the same as in the existing BLAS.  At present, we only allow Single,
Double, or Extra internal precision.  Extra precision is implemented as
double-double precision (128-bit total, 106-bit significand).  The routines
for the double-double precision basic arithmetic operations +, -, *, / were
developed by David Bailey.

		
We have designed all our routines assuming that single precision arithmetic is
actually done in IEEE single precision (32 bits) and that double precision
arithmetic is actually done in IEEE double precision (64 bits). The routines
also pass our tests on an Intel machine with 80-bit floating point registers.

		
*MIXED PRECISION* permits some input/output arguments to be of different types
(mixing real and complex) or precisions (mixing single and double).

		
The purpose of this implementation is to do a proof of concept implementation,
showing that the considerable complexity of the specification is actually
implementable and testable with a manageable amount of software. We have not
attempted to optimize performance, but our code should be as good as
straightforward but careful code written by hand.

		
2. Download
------------

		
http://www.netlib.org/xblas/xblas.tar.gz Current Version: 1.0.247 (15-Nov-2008)

		

		
3. Content
----------

		
The BLAS Standard defines language bindings for Fortran 95, Fortran 77, and C.
Here, we have only implemented the C version and provided a method for binding
to one Fortran 77 ABI.

		
In this initial release, we provide the following 11 routines:
	
.Level 1
        * DOT (Inner product)
        * SUM (Sum)
        * AXPBY (Scaled vector accumulation)
        * WAXPBY (Scaled vector addition)

		
.Level 2
        * GEMV (General matrix vector product)
        * GBMV (Banded matrix vector product)
        * SYMV (Symmetric matrix vector product)
        * SBMV (Symmetric banded matrix vector product)
        * SPMV (Symmetric matrix vector product, packed format)
        * HEMV (Hermitian matrix vector product)
        * HBMV (Hermitian banded matrix vector product)
        * HPMV (Hermitian matrix vector product, packed format)
        * GE_SUM_MV (Summed matrix-vector product)
        * TRSV (Triangular solve)

		
.Level 3
	* GEMM (General matrix matrix product)
	* SYMM (Symmetric matrix matrix product)
	* HEMM (Hermitian matrix matrix product)
   
All have passed our systematic testing of all possible combinations of mixed
and extended precision.  We will eventually include everything in the
intersection of Chapter 2 and Chapter 4 with systematic testing.

		
3.1 Directory Structure
~~~~~~~~~~~~~~~~~~~~~~~

		
This release contains the following directories:

		
       doc/          - technical report

		
       m4/           - Directory where m4 macro files and support
                       routines are stored
       m4/dot       \
       m4/sum        } Directories for each function
       m4/,..       /
       m4/test-dot  \
       m4/test-sum   } Directories for each test function
       m4/...       /

		
       src/          - Directory where C code is stored
       src/dot      \
       src/sum       } Target directories for C code
       src/...      /

		
       testing/      - Directory where C code for testing is stored
       testing/test-dot  - DOT test code and results
       testing/test-sum  - SUM test code and results
       testing/...

		
4. Installation
---------------

		
The reference XBLAS are built similarly to the reference BLAS and LAPACK.  The
current build system produces a static `libxblas.a`.

		
You need to provide a make.inc file in the source directory that defines the
compiler, optimization flags, and options for building a Fortran->C bridge.
Some examples are provided.

		
Alternatively, you can use the configure script to attempt to produce a
make.inc that is appropriate for your system with your C (and optionally,
Fortran) compiler.  You need to issue ./configure in the top-most directory.
This will produce a make.inc file that you should check before proceeding.  To
specify a specific compiler(s) to use, you will need to do something like:
     
[source,makefile]
---------------------------------------------------
     CC=my_c_compiler FC=my_fortran_compiler ./configure
---------------------------------------------------
   
M4 is not necessary for the distributed archive; it is only necessary if you
modify the generator sources under m4/.  See README.devel for more
information.

		
The Fortran->C bridge uses details of a specific toolchain's binary interface,
in particular how the Fortran compiler mangles names.  See `src/f2c-bridge.h`
for the available options.  Most compilers can support different name mangling
schemes; be sure to use the *SAME* naming options for all your Fortran code.

		
The Fortran->C bridge is included in `libxblas.a`.  Each of the bridge's object
files matches *-f2c.o, so you can extract them with ar if you need to share
one libxblas.a between multiple Fortran compilers.  Example steps to strip the
Fortran->C routines from libxblas.a and place them in a separate
libxblas-myfortran.a are as follows:

		
[source,shell]
---------------------------------------------------
     ar t libxblas.a |fgrep -- -f2c.o | xargs ar x libxblas.a
     ar ru libxblas-myfortran.a *-f2c.o
     ar x libxblas.a *-f2c.o
     rm *-f2c.o
---------------------------------------------------

		
5. Code Generation with M4 macro processor
------------------------------------------

		
In the existing BLAS, there are usually 4 routines associated with each
operation.  All input, output, and internal variables are single or double
precision and real or complex.  But under the new extended and mixed precision
rules (see Chapter 4 for details), the input, output and internal variables
may have different precisions and types. Therefore, the combination of all
these types results in many more routines associated with each operation. For
example, DOT will have 32 routines altogether, 4 ``standard'' versions (from
Chapter 2) and 28 mixed and extended precision versions (from Chapter 4).  In
addition, the 16 versions with extended precision support up to three internal
precisions that can be chosen at runtime.  We have automated the code and test
code generation as much as possible.  We use the M4 macro processor to
facilitate this task.

		
The idea is to define a macro for each fundamental operation. The macro's
argument list contains the variables, accompanied by their types and
precisions. For example, for the operation c <- a + b, we define the following
macro:

		
----
ADD(c, c_type, a, a_type, b, b_type)
----

		
where, x_type can be one of:

		
	real_single
	real_double
	real_extra
	complex_single
	complex_double
	complex_extra

		
Inside the macro body, we use an ``if-test'' on c_type, a_type and b_type, to
generate the appropriate code segment for ``plus''.  This is similar to
operator overloading in C++; but we do it manually. All these if-tests are
evaluated at macro-evaluation time, and do not appear in the executable code.
Indeed, our goal was to produce efficient C code, which means minimizing
branches in inner loops.

		
Other macros include SUB, MUL, DIV, DECLARE (variable declaration), ASSIGN,
etc.

		
Since these macros are shared among all the BLAS routines, we put them in a
common header file, named `cblas.m4.h`.  Each BLAS routine also has its own
macro file, such as `dot.m4`, `spmv.m4` and `gbmv.m4`, to generate the specific
functions.  All the macro files are located in the m4/ subdirectory.

		
For example, the inner loop of the M4 macro for the dot product is simply as
follows (the M4 parameters $2, $3, and $4 are types):

		
[source,c]
-------------------------------------------------------------------
        for (i = 0; i < n; ++i) {
            GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2) 
                  /* put ix-th element of vector x into x_ii */
            GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3) 
                  /* put iy-th element of vector y into y_ii */
            MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */
            ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */
            ix += incx;
            iy += incy;
        } /* endfor */
-------------------------------------------------------------------

		
The motivation for this macro-based approach is simplifying
software engineering. For example, the file `dot.m4`
of M4 macros for the dot product is 401 lines long (245 non-comment lines)
but expands into 11145 lines in 32 C subroutines implementing
different versions of DOT. Similarly the macros for TRSV expand from
732 lines (454 non-comment lines) to 37099 lines in 32 C subroutines.
(This does not count the shared M4 macros in the file `cblas.m4.h`.)

		

		
6. Testing
----------
  
The goal of the testing code is to validate the underlying implementation.
The challenges are twofold: First, we must thoroughly test routines claiming
to use extra precision internally, where the test code is not allowed to
declare any extra precision variables or use any other extra precision
facilities not available to the code being tested. This requires great care in
generating test data.  Second, we must use M4 to automatically generate the
many versions of test code needed for the many versions of the code being
tested.

		
For each BLAS routine, we perform the following steps in the test code:

		
     1. Generate input scalars, vectors and arrays, according to the 
        routine's specification, so that the result exposes the internal
        precision actually used.
     2. Call the BLAS routine
     3. For each output, compute a ``test ratio'' of the computed
	error to the theoretical error bound, i.e.,

		
		| Computed_value - "True_value" | / Error_Bound

		
By design, the test ratio should be at most 1. A larger ratio indicates that
the computed result is either completely wrong, or not as accurate as claimed
in the specification.

		
The following section will discuss how we generate ``good'' inputs in order to
reveal the internal precisions actually used.  For details, see the paper in
file `doc/report.ps`.

		
6.1 Testing DOT
~~~~~~~~~~~~~~~

		
DOT performs the following function:

		
		r <- beta * r_in + alpha * (SUM_{i=1,n} x_i*y_i)

		
Assume that the result `r_computed` is computed as follows

		
- precision `eps_int` internally,
- precision `eps_out` when the final result is rounded on output,
- underflow threshold `UN_int` internally,
- underflow threshold `UN_out` on output

		
and that additionally we compute a very accurate approximation `r_acc` with

		
- precision `eps_acc`  = 2^-106^   (double-double format)
- underflow threshold `UN_acc`  = 2^-968^

		
Then the error bound satisfies

		
	|r_computed-r_acc| <= (n+2)(eps_int + eps_acc)*S + U + eps_out*|r_acc| (*)
                            = Scale

		
where

		
       	   S = |alpha| * (SUM_{i=1,n} |x_i|*|y_i|) + |beta|*|r_in|
           U = (|alpha|*n+2)*(UN_int + UN_acc) + UN_out

		
Thus, we can confirm that `r_computed` has been computed as accurately as
claimed (i.e. with internal precision defined by `eps_int` and `UN_int`) by
testing whether the 

		
          test ratio = |r_computed - r_acc| / Scale 

		
is at most `1`.  Suppose that no underflow occurs, and that 

		
          eps_intX >> eps_int >= eps_acc.

		
where `eps_intX` is the internal precision actually used in some buggy version
of DOT that we want to test.  Then we can expect the test ratio to be as large
as

		
                        (n+2)*(eps_intX + eps_acc)*S + eps_out*|r_acc|
          test ratio ~  ----------------------------------------------
                        (n+2)*(eps_int  + eps_acc)*S + eps_out*|r_acc|

		
                     
If we can make `r_acc` very small, then this ratio will be roughly 
 
          test ratio =  eps_intX / eps_int >> 1

		
which means the bug in DOT will be detected, and in fact the test ratio
will actually tell us how much internal precision we effectively used.

		
Thus our goal is to pick test data alpha, `x(1:n)`, `y(1:n)`, `beta` and `r_in` to
make `|r_acc|` as small as possible, MUCH smaller than `S`, so that `eps_int` term
dominates on the right of the inequality (*).  Otherwise `eps_int` will be
invisible in the error bound, then we cannot tell what precision is used
internally. 

		
In our test generator, we choose input data `alpha`, `beta`, `r_in`, `x(1:n)` and
`y(1:n)` judiciously in order to cause as much cancellation in `r` as possible.

		
6.2 Choosing input data and computing r_acc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

		
The general approach is to choose some of the input values of `x(i)` and
`y(i)` so that the exact (partial) dot product of these values has a lot of
nonzero fraction bits, preferably at least 106. Then the remaining values of
`x(i)` and `y(i)` are chosen to cancel the previous bits as much as possible.
This latter computation seems to require high precision.

		
One possibility to use an arbitrary precision package, such as MPFUN, but our
goal is to make this test code self contained, and use no extra precision
facilities not available to the code being tested.  Since most of our routines
can be reduced to a series of dot products, testing can be based on DOT.  We
only need a TRUSTED dot routine. Currently, we are using our own dot routine
with double-double internal precision to compute `r_truth`, which is accurate to
106 bits.  This means that any internal precision higher than double-double
cannot be detected, and may result in a tiny test ratio.  A very tiny test
ratio (such as zero) may also occur if the result happens to be computed
exactly.

		
This raises the possibility that there are ``matching'' bugs in our trusted DOT
and the DOT under test, so that bugs are missed.  To avoid this possibility we
also generate some test examples where the cancellation is done mathematically
(and so exactly) rather than depending on a computation. The idea is simple:
For example, choose `x(1:3)` and `y(1:3)` so that 

		
          x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)

		
so that `SUM_{i=1,3} x(i)*y(i) = x(2)*y(2)` exactly.

		
6.3 Testing SPMV and GBMV
~~~~~~~~~~~~~~~~~~~~~~~~~
    
SPMV, GBMV, and many other Level 2 BLAS routines perform the following function:

		
		y <- beta * y + alpha * A * x

		
Testing it is no more difficult than testing DOT, because each component of
the computed y vector is a dot product, and satisfies the error bound (*). So
we simply use the same test generator as DOT, and compute a test ratio for
each component of the `y` vector. The only tricky part is that some entries in
each dot product may be fixed. For example, the first row of `A` and the vector
`x` can be chosen freely, but after that `x` is fixed and, if `A` is symmetric, the
first entry of each subsequent row is fixed. Our dot-product test generator
handles all these cases.

		
This approach can be generalized to most other Level 2 and 3 BLAS.

		
7. Feedback
-----------

		
Please send any comments or bug reports to mailto:extended_blas@cs.berkeley.edu[].
This code was developed by

		
   * Xiaoye Li,
   * Jim Demmel, 
   * David Bailey,
   * Yozo Hida,
   * Jimmy Iskandar,
   * Anil Kapur,
   * Michael Martin,
   * Brandon Thompson,
   * Teresa Tung,
   * Daniel Yoo
   
with help from Ben Wanzo, Berkat Tung, Weihua Shen, Jason Riedy, and
Deaglan Halligan.

		

		

		
.

NEW PAGES:

[ODDNUGGET]

[GOPHER]