Developer Login
Please login here to access extra resources available to registered developers.
Email:
Password:
Forgot your password?
Request a login account

Glossary

Here are a selection of terms commonly used in the world of HPC, as well as terms and product names unique to ClearSpeed.

A

Advance e620 is a double-precision, IEEE 754 compliant floating-point, PCIe based accelerator board from ClearSpeed for systems whose architecture incorporates the PCIe standard.

Advance X620 is a double-precision, IEEE 754 compliant floating-point, standard height, two-thirds length PCI-X accelerator board from ClearSpeed for systems whose architecture incorporates the PCI-X standard.

Aggregate types are types derived from other types (e.g.: arrays, structures, unions or pointers.)  These types may hold several data objects.  For examples, see Chapter 5 of SDK Getting Started Guide.

AMBER (an acronym for Assisted Model Building with Energy Refinement) is a molecular dynamics simulation package that has been accelerated by ClearSpeed.

ALU (an acronym for Arithmetic Logic Unit) is a digital circuit that performs arithmetic and logical operations. The ALU is a fundamental building block of the CPU, and even the simplest microprocessors contain one for purposes such as maintaining timers.

ASIC (an acronym for application-specific integrated circuit) is an integrated circuit (IC) customized for a particular use.

B

Basic Linear Algebra Subprograms (BLAS) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.  The CSXL library implements an accelerated version of BLAS and supports functions such as BLAS Level 3 (DGEMM.)  See CSXL.

Basic type is a variable which stores a single data object (e.g.: a char, an int.)

Bristol University Docking Engine (Bude) is a generic molecular docking program that is under development.

C

ClearConnect Bus (CCB) is a scalable, high-bandwidth bus that connects the MTAP processors, coprocessors and memory.

ClearConnect Bus Bridge Port (CCBR) is used to connect CSX processors together or FPGAs. The CCBR effectively joins the CCB in each device. The CCBR is a DDR half-duplex connection.

ClearSpeed CSX600 is a 64-bit floating point, embedded parallel processor with 96 cores that executes up to 25 billion 64-bit floating-point operations per second (Flops) while averaging less than 10 Watts of energy. 

ClearSpeed discrete Fourier transform (CSDFT) library provides acceleration for FFT functions.  It consists of two parts: a host library and a library for the ClearSpeed accelerator board.  These two libraries have a common API and a number of functions in common.  For more information, see the ClearSpeed Accelerated DFT Library Reference Manual.

Cn is a ClearSpeed dialect of ANSI C with extensions for data-parallel programming. The main addition to standard C is the definition of mono (scalar) and poly (parallel) data types.  For an overview comparing Cn to ANSI C, see Chapter 5 of SDK Getting Started Guide.

CSXL is the ClearSpeed math library; it provides accelerated versions of a number of standard math functions for use with the ClearSpeed accelerator boards.  The supported routines are a subset of the BLAS and LAPACK libraries.  For instructions on using the CSXL application acceleration library, see the CSXL User Guide.

Cycle Accurate Simulator (casim) is a highly accurate model which includes cycle timing behavior and can be used for detailed profiling of your application code. In contrast the Instruction Set Simulator (isim) is usually much faster but is not cycle-accurate.

D

DDR2 SRAM (an acronym for double data rate two synchronous random access memory) is a random access memory technology used for high speed storage of the working data of a computer.  Its primary benefit is the ability to run its bus at twice the speed of the memory cells it contains, thus enabling faster bus speeds and higher peak throughputs than earlier technologies.

DGETRF and DGETRS are the LAPACK routines that factor and solve a real double precision general system of linear equations using the LU method.

Direct Memory Access (DMA) is a fast way of transferring data within (and sometimes between) computers; usually characterized by peripheral devices transferring data to and from system memory without involving the central processor.  The on-chip DMA controller can be programmed to transfer data to and from the external memory interface and any other device on the ClearConnect bus.

DORGQR is the LAPACK routine that generates all or part of the orthogonal matrix Q from a QR factorization computed by DGEQRF.

DORMQR is the LAPACK routine that multiplies a matrix by the orthogonal matrix Q from a QR factorization computed by DGEQRF.

Double Data Rate (DDR) Data transfer method where two data entities are transferred per clock cycle. Used by both the CSX600 memory interface and the ClearConnect bridge ports used to connect CSX600 chips together.

Double Precision Double precision floating point is an IEEE 754 standard for encoding floating point numbers that uses 8 bytes.

Double-precision General Matrix Multiply (DGEMM) is the BLAS level-3 routine that multiplies a real double precision matrix by a real double precision matrix. It is also an important routine in the LINPACK benchmark.

DPOTRF and DPOTRS are the LAPACK routines that factor and solve a real double precision symmetric positive definite system of linear equations using the Cholesky method.

DTRSM is the BLAS level-3 routine that solves a real double precision triangular system of equations with multiple right-hand-sides. This routine constitutes a large percentage of the computation done in the LAPACK routines that factor and solve a general system of linear equations, respectively DGETRF and DGETRS.

E

Error checking and correcting (ECC) is a method of detecting and correcting errors in a computer having a memory subsystem

Embarrassingly parallel problem is a computational problem for which no particular effort is needed to divide the problem into a very large number of parallel tasks, and there is no essential dependency (or communication) between those parallel tasks.  Some examples of embarrassingly parallel problems include:

F

FLoating point Operations Per Second (FLOPS) is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations; similar to instructions per second.

Floating point unit (FPU) is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root.

In order for FLOPS to be useful as a measure of floating-point performance, a standard benchmark must be available on all computers of interest. One example is the LINPACK benchmark.

 

G

General Matrix Multiply (GEMM) is a subroutine in the Basic Linear Algebra Subprograms (BLAS) that performs matrix multiplication (i.e.: the multiplication of two matrices.) This includes DGEMM, for double-precision.

GigaGLOPS is a measure of a computer's speed; one billion floating point operations per second (see FLOPS)

H

Host/debug port (HDP) is a host interface on the CSX600 processor that allows the CSX600 to communicate with, and be controlled by, the system's host processor. This port can also be used as a hardware and software debug port as it provides full access to all the internal registers on the device.

I

IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including negative zero and denormal numbers) and special values (infinities and NaNs) together with a set of floating-point operations that operate on these values.

For more information see the IEEE 754 home at grouper.ieee.org/groups/754/

Instruction Set Simulator (isim) is a functional accurate simulation model of the CSX600.

J

K

L

Linear Algebra PACKage (LAPACK) is a software library for numerical computing that depends BLAS in order to effectively exploit the caches on modern cache-based architectures.  The CSXL library implements an accelerated version of LAPACK and supports functions such LAPACK (DGERTF, DGESV.)  See BLAS and CSXL.

LINPACK is a linear algebra software package that makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.  LINPACK is also a benchmark derived from it that consists of solving a dense system of linear equations. The LINPACK benchmark has different versions, according to the size of the system solved. The TOP500 ranking uses a version where the chosen system size is large enough to get maximum performance.  For more information, read the ClearSpeed Accelerated LINPACK performance paper.

M

Mono execution unit in the CSX600 processes mono (scalar or non-parallel) data and handles program flow control such as branching and thread switching.

mono is the Cn multiplicity specifier keyword that specifies the object exists in the mono domain (i.e..: a single instance.)  For example mono int a; is equivalent to int a;  See multiplicity specifier and poly.

Mono memory is memory associated with mono data; also referred to as local memory.  There is one instance of the memory accessible by all PEs.  This memory may be on chip and/or on the same PCB card as the CXS processor.

Mono variable is a variable that has one instance.  This can be a basic or aggregate type.

Multiple Instruction stream, Multiple Data stream (MIMD) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently.

Multiplicity specifier provides the means of representing poly data by allowing the programmer to specify the domain in which the declaration will exist: mono or poly.   The default multiplicity is mono.  See mono and poly; and for a fuller explanation with examples, see Chapter 5 of SDK Getting Started Guide.

Multiply-Accumulate (MAC) is a common operation that computes the product of two numbers and adds that product to an accumulator.  A MAC-unit consists of a multiplier implemented in combinational logic followed by an adder and an accumulator register which stores the result when clocked. The output of the register is fed back to one input of the adder, so that on each clock the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers.

Multi-Threaded Array Processor (MTAP) contains an array of 96 Processing Elements (PEs) operating in SIMD.  For more information, see ClearSpeed's CSX Processor Architecture white paper.

N

O

P

PAPI (an acronym for Performance Application Programming Interface) aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.

PetaFLOP is a measure of a computer's processing speed and can be expressed as a thousand trillion floating point operations per second.

Peripheral Virtual Component Interface (PVCI) is a standard intended to simplify the interfacing of peripheral cores to on-chip buses in a system-on-a-chip, by standardizing the interface between a core's internals and its bus wrapper.  See VCI.

poly is the Cn multiplicity specifier keyword for defining poly (parallel) data types.  For example poly int b;  See multiplicity specifier and mono.

Poly execution unit is a SIMD array of 96 PEs that operate in a synchronous manner where each PE executes the same instruction on its piece of data.

Poly memory (also known as PE memory) is memory associated with poly data. Each PE has its own local block of poly memory; each instance of poly memory is only visible to the corresponding PE.  Note, mono and poly memory are two physically distinct memory spaces, with their own memory maps.

Poly variable has many instances with, typically, different data values on each poly Processing Element (PE.)  This can be a basic or aggregate type.

Processing Element (PE) is the basic PE array building block. A PE has an ALU, 32 + 64 bit FPU, a MAC, a register file, memory and I/O.

Programmable Logic Device (PLD) is an electronic component used to build digital circuits.

Programmed Input/Output (PIO) is a mechanism for transferring data between PE (poly) memory and main (mono) memory.

Q

R

S

Single Instruction, Multiple Data (SIMD) is a technique employed to achieve data level parallelism, as in a vector or array processor.  First made popular in large-scale supercomputers (as opposed to MIMD parallelization), smaller-scale SIMD operations have now become widespread in personal computer hardware.

Single Precision Single precision floating point is an IEEE 754 standard for encoding floating point numbers that uses 4 bytes.

Streaming SIMD Extensions (SSE) is a SIMD (Single Instruction, Multiple Data) instruction set designed by Intel.  SSE contains 70 new instructions including both scalar and packed floating point instructions.

System-on-a-chip (SoC) refers to integrating all components of a computer or other electronic system into a single integrated circuit (chip.)

T

TeraFLOP is a measure of a computer's speed; a trillion floating point operations per second.

U

V

Virtual Component Interface (VCI) is a proposed standard interface between a core's internals and a core's bus wrapper.  The VCI is a far simpler protocol than a typical bus protocol, since it is a point-to-point transfer protocol.

W

X

Y

Z

ZGEMM is the BLAS level-3 routine that multiplies a complex double precision matrix by a complex double precision matrix.

ZGEMM3M is an implementation of ZGEMM that requires 3 real matrix multiplications and 5 real matrix additions to compute the complex matrix product; ZGEMM uses 4 real matrix multiplications and 2 real matrix additions. This may be faster than the standard implementation under certain circumstance.