Research Topics
  1. Computer Arithmetic and Hardware Implementation
  2. Computer Architecture/Organization and CPU Design
  3. Interconnection Networks and Fault-Tolerant Routing

Computer Arithmetic and Hardware Implementation

Computer arithmetic deals with mathematical functions that are suitable for hardware implementation.It is fundamental to the designs of processor IU (integer unit) and FPU (floating point unit).

The basic operations computer arithmetic deals with include add, subtract, multiplication, division, and square root. Square root algorithm and its hardware implementation have been getting more attentions recent years in the designs of general-purpose processors, CG (computer graphics), and embedded CPUs.

Let's take an example to see how to calculate the square root of a decimal number by hand: Try to calculate the square root of 30,000,000,000.

        _1__7__3__2__0__5_
    1  / 3 00 00 00 00 00               (1 is the largest number such that 1 * 1 < 3)
    1   -1                 <--- 1 * 1 = 1
    ----------------------
    27   2 00              <--- 1 + 1 = 2 (7 is the largest number such that 27 * 7 < 200)
     7  -1 89              <--- 27 * 7 = 189
    ----------------------
    343    11 00           <--- 27 + 7 = 34 (3 is the largest number such that 343 * 3 < 1100)
      3   -10 29           <--- 343 * 3 = 1029
    ----------------------
    3462      71 00        <--- 343 + 3 = 346 (2 is the largest number such that 3462 * 2 < 7100)
       2     -69 24        <--- 3462 * 2 = 6924
    ----------------------
    34640       1 76 00    <--- 3462 + 2 = 3464 (0 is the largest number such that 34640 * 0 < 17600)
        0            -0    <--- 34640 * 0 = 0
    ----------------------
    346405      1 76 00 00 <--- 34640 + 0 = 34640 (5 is the largest number such that 346405 * 5 < 1760000)
         5     -1 73 20 25 <--- 346405 * 5 = 1732025
    ----------------------
                   2 79 75 <--- remainder

That is, the square root of 30,000,000,000 is 173,205 and the remainder is 27,975.

The reason why this algorithm works is explained below. We separated the decimal number into groups with 2 digits in each group, starting from the decimal point (in our example, the left-most group, 1, can be considered as 01). The largest value of a group is 99 = 10 * 9 + 9. We use AB to denote this number. That means AB = 10 * A + B. The square root of AB is denoted as a. Then we have a ≤ 9. Suppose we have calculated a and subtracted a2 from AB: R = AB - a2.

Now, consider ABCD - (ab)2. Because (ab)2 = (10 * a + b)2 = 100 * a2 + 20 * a * b + b2. Then

ABCD - (ab)2 = (100 * AB + CD) - (100 * a2 + 20 * a * b + b2) = (100 * R + CD) - (10 * (a + a) + b) * b.

This procedure is repeated until all the groups are calculated. We have proposed a non-restoring (on remainder) square root algorithm for binary numbers. The following is a C version of the algorithm.

    unsigned squart(D, r) // Non-Restoring square root
       unsigned D;        // D:32-bit unsigned integer to be square rooted
       int     *r;        // Remainder
    {
       unsigned Q = 0;    // Q:16-bit unsigned integer (root)
       int R = 0;         // R:17-bit integer (remainder)
       int i;

       for (i = 15; i >= 0; i--) {                // for each root bit
          if (R >= 0) {                           // remainder: non-negative
             R = (R << 2) | ((D >> (i + i)) & 3); // new remainder
             R = R - ((Q << 2) | 1);              // R = R - Q01
          } else {                                // remainder: negative
             R = (R << 2) | ((D >> (i + i)) & 3); // new remainder
             R = R + ((Q << 2) | 3);              // R = R + Q11
          }
          if (R >= 0) Q = (Q << 1) | 1;           // new Q bit 1
          else        Q = (Q << 1) | 0;           // new Q bit 0
       }
                    
       if (R < 0) R = R + ((Q << 1) | 1); // remainder adjusting
       *r = R;                            // return remainder
       return(Q);                         // return root
    }

Here is an 8-bit example: D = 01111111 (127). We got Q = 1011 (11) and R = 0110 (6). That is, 127 = 112 + 6.

    D = 01 11 11 11   R = 0    Q = 0000 R: non-negative
    R = 01
      - 01           (-Q01)
    R = 00 11                  Q = 0001 R: non-negative
      - 01 01        (-Q01)
    R = 11 10 11               Q = 0010 R: negative
      + 00 10 11     (+Q11)
    R = 00 01 10 11            Q = 0101 R: non-negative
      - 00 01 01 01  (-Q01)
    R = 00 00 01 10            Q = 1011 R: non-negative

We have developed a VLSI (very large scale integration) implementation for the non-restoring square root algorithm. To my knowledge, this implementation is the simplest one in the world.

Our algorithm has the following features unlike other square root algorithms. First, the focus of the ``non-restoring'' is on the ``partial remainder'', not on ``each bit of the square root'', with each iteration. Second, it only requires one traditional adder/subtracter in each iteration, i.e., it does not require other hardware components, such as seed generators, multipliers, or even multiplexers. Third, it generates the correct resulting value even in the last bit position. Next, based on the resulting value of the last bit, a precise remainder can be obtained immediately without any correction or addition operation. And finally, it can be implemented at very fast clock rate because of the very simple operations at each iteration.

References

  1. ``A New Non-Restoring Square Root Algorithm and Its VLSI Implementations'', Proceedings of International Conference on Computer Design -- VLSI in Computers and Processors, Oct. 7-9, 1996, Austin, Texas, U.S.A. IEEE Computer Society Press, pp.538-544. ICCD96.pdf
  2. ``Implementation of Single Precision Floating Point Square Root on FPGAs'', IEEE Symposium on FPGAs for Custom Computing Machines, Apr.16-18, 1997, Napa, California, U.S.A. IEEE Computer Society Press, pp.226-232. FCCM97.pdf
  3. ``Parallel-Array Implementations of A Non-Restoring Square Root Algorithm'', Proceedings of International Conference on Computer Design -- VLSI in Computers and Processors, Oct. 12-15, 1997, Austin, Texas, U.S.A. IEEE Computer Society Press, pp.690-695. ICCD97.pdf
  4. ``Cost/Performance Tradeoff of n-Select Square Root Implementations'', Australian Computer Science Communications, Vol.22, No.4, IEEE Computer Society Press, 2000, pp.9-16. ACAC2000.pdf

Computer Architecture/Organization and CPU Design

Computer architecture is concerned with the structure and behavior of the various functional modules of computer and how they interact to provide the processing needs for the user. Computer organization is concerned with the way the hardware components are connected together to form a computer system. Computer design is concerned with the development of the hardware for the computer taking into consideration of a given set of specifications.

CPU (central processing unit) or processor is a key component in the computer design. The techniques used to design a CPU include pipelining, superscalar, multithreading, and multi-core (or chip multiprocessors).

We have developed several models and prototypes in this field.

References

  1. ``The Effects of STEF in Finely Parallel Multithreaded Processors'', Proceedings of the First IEEE Symposium on High-Performance Computer Architecture, January 22-25, 1995, North Carolina, U.S.A. IEEE Computer Society Press, pp.318-325. HPCA95.pdf
  2. ``Design and Implementation of a Multiple-Instruction-Stream Multiple-Execution-Pipeline Architecture'', International Conference on Parallel and Distributed Computing and System, Oct. 19-21, 1995, Washington D.C., U.S.A. pp.477-480. PDCS95.pdf
  3. ``Using Computer Architecture/Organization at the University of Aizu'', Second Workshop on Computer Architecture Education, Feb. 3, 1996, San Hose, California, U.S.A. Also see ``Using FPGA for Computer Architecture/Organization Education'', IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, June 1996, IEEE Computer Society Press, pp.31-35. TCCA96JUNE.pdf
  4. ``Aizup -- A Pipelined Processor Design and Implementation on XILINX FPGA Chip'', Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Apr.17-19, 1996, Napa, California, U.S.A. IEEE Computer Society Press, pp.98-106. FCCM96.pdf
  5. ``Implementation of Single Precision Floating Point Square Root on FPGAs'', IEEE Symposium on FPGAs for Custom Computing Machines, Apr.16-18, 1997, Napa, California, U.S.A. IEEE Computer Society Press, pp.226-232. FCCM97.pdf
  6. ``Memory Centric Interconnection Mechanism for Message Passing in Parallel Systems'', Third International Conference on Massively Parallel Computing Systems Broadmoor Hotel, Colorado Springs, Colorado, USA April 6-9, 1998. MPCS98.pdf
  7. ``JAViR -- Exploiting Instruction Level Parallelism for JAVA Machine by Using Virtual Registers'', The Second European Parallel and Distributed Systems Conference Vienna, Austria, July 1-3, 1998. pp.80-86. EuroPDS98.pdf
  8. ``A Model for Predicting Utilization of Multiple Pipelines in MTMP Architecture'', International Journal of Modelling and Simulation, Volume 18, Issue 3, 1998. pp.201-207. IJMS98.pdf
  9. ``Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture'', Australian Computer Science Communications, Vol.21, No.4, Springer-Verlag Singapore, 1999. pp.75-84. ACAC99.pdf
  10. ``Exploiting Java Instruction/Thread Level Parallelism with Horizontal Multithreading'', Australian Computer Science Communications, Vol.23, No.4, IEEE Computer Society Press, 2001. ACSAC01.pdf
  11. PDCAT03-cache.pdf``An Instruction Cache Architecture for Parallel Execution of Java Threads'', Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'03), Aug. 27-29, 2003 Chengdu, China. pp.180-187. PDCAT03-cache.pdf
  12. Second Workshop on Computer Architecture Education (WCAE-2), at HPCA-2, February 1996.

Interconnection Networks and Fault-Tolerant Routing

Interconnection networks are used both in parallel processing and in communication systems to exchange information. The popular topologies of interconnection networks include ring, mesh, torus, and hypercube, as shown as below.

References

  1. ``Dual-Cubes: A New Interconnection Network for High-performance Computer Clusters'', Workshop on Computer Architecture, International Computer Symposium 2000, December 6-8, 2000, National Chung Cheng University, ChiaYi, Taiwan. ICS2000.pdf
  2. ``Fault-tolerant Routing and Disjoint Paths in Dual-cube: a New Interconnection Network'', Proceedings of the 2001 International Conference on Parallel and Distributed Systems (ICPADS'2001), IEEE Computer Society Press, 2001, pp.315-322. ICPADS01.pdf
  3. ``Algorithms of Routing and Matrix Multiplication on Dualcube'', Proceedings of the Second International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD'01), Nagoya, Japan, Aug., 2001, pp422-429. SNPD01.pdf
  4. ``Efficient Collective Communications in Dual-cube'', Proceedings of the Thirteen IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS-2001), Anaheim, USA, Aug., 2001, pp266-271. PDCS01.pdf
  5. ``Metacube -- A New Interconnection Network for Large Scale Parallel Systems'', Australian Computer Science Communications}, Vol.24, No.3, 2002, Australian Computer Society, pp.29-36. ACSAC02.pdf
  6. ``Efficient Communication in Metacube: A New Interconnection Network'', Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN 2002), Manila, Philippines, May 2002, IEEE Computer Society Press, pp.165-170. ISPAN02.pdf
  7. ``Multinode Broadcasting in Metacube'', Proceedings of the 3rd ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD'02), Madrid, Spain, June 26-28, 2002, pp.401-408. SNPD02.pdf
  8. ``Fault-tolerant Routing in Metacube'', Proceedings of the Third International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'02), Kanazawa Bunka Hall, Kanazawa, Japan, September 2002, pp.343-350. PDCAT02.pdf
  9. ``Hamiltonian Cycle Embedding for Fault Tolerance in Dual-cube'', Proceedings of the IASTED International Conference on Networks, Parallel and Distributed Processing, and Applications (NPDPA 2002), Tsukuba, Japan, October 2002, pp.1-6. NPDPA02.pdf
  10. ``Fault-Tolerant Cycle Embedding in Dual-Cube with Node Faulty'', Proceedings of the Fourth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'03), Aug. 27-29, 2003 Chengdu, China. pp.71-78. PDCAT03-dualcube.pdf
  11. ``Disjoint Paths in Metacube'', Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS-2003), Marina del Rey, CA, USA, Nov. 3-5, 2003, pp.43-50. PDCS03.pdf
  12. ``Efficient Collective Communications in Dual-cube'', The Journal of Supercomputing, Volume 4, Issue.1, Apr., 2004, pp71-90. http://www.springerlink.com/
  13. ``Adaptive-Subcube Fault Tolerant Routing in Dual-Cube with Very Large Number of Faulty Nodes'', Proceedings of the ISCA 17th International Conference on Parallel and Distributed Computing Systems, San Francisco, California USA, September, 2004, pp222-228. PDCS04.pdf
  14. ``An Efficient Algorithm for Fault Tolerant Routing Based on Adaptive Binomial-Tree Technique in Hypercubes'', Proceedings of the Fifth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'04), Dec. 8-10, 2004 Singapore. pp.196-201. PDCAT04.pdf
  15. ``Binomial-Tree Fault Tolerant Routing in Dual-Cubes with Large Number of Faulty Nodes'', Proceedings of the International Symposium on Computational and Information Sciences (CIS'04). Shanghai, China, December 16-18, 2004. pp.51-56. CIS04.pdf