To help you find everything you need easily, all of ClearSpeed's technical documentation is listed below. You may download any or all of ClearSpeed's technical manuals as well as white papers, performance papers and more.
Advance e620
is a double-precision, IEEE 754 compliant
floating-point, PCIe based accelerator board from ClearSpeed for systems whose
architecture incorporates the PCIe standard.
Advance X620
is a double-precision, IEEE 754 compliant
floating-point, standard height, two-thirds length PCI-X accelerator board from
ClearSpeed for systems whose architecture incorporates the PCI-X standard.
Aggregate types are types derived from other types (e.g.: arrays, structures, unions or pointers.) These types may hold several data objects. For examples, see Chapter 5 of SDK Getting Started Guide.
AMBER (an acronym for Assisted Model Building with Energy Refinement) is a molecular dynamics simulation package that has been accelerated by ClearSpeed.
ALU (an acronym for Arithmetic Logic Unit) is a digital circuit that performs arithmetic and logical operations. The ALU is a fundamental building block of the CPU, and even the simplest microprocessors contain one for purposes such as maintaining timers.
ASIC (an acronym for application-specific integrated circuit) is an integrated circuit (IC) customized for a particular use.
Basic Linear
Algebra Subprograms (BLAS) are routines that provide
standard building blocks for performing basic vector and matrix operations. The
Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS
perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix
operations. The CSXL library
implements an accelerated version of BLAS and supports functions such as BLAS
Level 3 (DGEMM.) See CSXL.
Basic type is a variable which stores a single data object (e.g.: a char, an int.)
Bristol
University Docking Engine (Bude) is a generic
molecular docking program that is under development.
ClearConnect Bus (CCB) is a scalable, high-bandwidth bus that connects the MTAP processors, coprocessors and memory.
ClearConnect Bus Bridge Port (CCBR) is used to connect CSX processors together or FPGAs. The CCBR effectively joins the CCB in each device. The CCBR is a DDR half-duplex connection.
ClearSpeed CSX600 is a 64-bit floating point, embedded parallel processor with 96 cores that executes up to 25 billion 64-bit floating-point operations per second (Flops) while averaging less than 10 Watts of energy.
ClearSpeed
discrete Fourier transform (CSDFT) library provides acceleration for FFT functions. It consists of two parts: a host
library and a library for the ClearSpeed accelerator board. These two libraries have a common API
and a number of functions in common.
For more information, see the ClearSpeed
Accelerated DFT Library Reference Manual.
Cn is a ClearSpeed dialect of ANSI C with extensions for data-parallel programming. The main addition to standard C is the definition of mono (scalar) and poly (parallel) data types. For an overview comparing Cn to ANSI C, see Chapter 5 of SDK Getting Started Guide.
CSXL is the ClearSpeed math library; it provides accelerated versions of a
number of standard math functions for use with the ClearSpeed accelerator
boards. The supported routines are
a subset of the BLAS and LAPACK libraries. For instructions on using the CSXL application acceleration
library, see the CSXL
User Guide.
Cycle Accurate Simulator (casim) is a highly accurate model which includes cycle timing behavior and can be used for detailed profiling of your application code. In contrast the Instruction Set Simulator (isim) is usually much faster but is not cycle-accurate.
DDR2 SRAM (an acronym for double data rate two synchronous random access memory) is a random access memory technology used for high speed storage of the working data of a computer. Its primary benefit is the ability to run its bus at twice the speed of the memory cells it contains, thus enabling faster bus speeds and higher peak throughputs than earlier technologies.
DGETRF and DGETRS are the LAPACK routines that factor and solve a real double precision
general system of linear equations using the LU method.
Direct Memory Access (DMA) is a fast way of transferring data within (and sometimes between) computers; usually characterized by peripheral devices transferring data to and from system memory without involving the central processor. The on-chip DMA controller can be programmed to transfer data to and from the external memory interface and any other device on the ClearConnect bus.
DORGQR is the LAPACK routine that generates all or part of the orthogonal
matrix Q from a QR factorization computed by DGEQRF.
DORMQR is the LAPACK routine that multiplies a matrix by the orthogonal
matrix Q from a QR factorization computed by DGEQRF.
Double Data
Rate (DDR) Data transfer method where two data
entities are transferred per clock cycle. Used by both the CSX600 memory
interface and the ClearConnect bridge ports used to connect CSX600 chips
together.
Double
Precision Double precision floating point is an IEEE
754 standard for encoding floating point numbers that uses 8 bytes.
Double-precision General Matrix Multiply (DGEMM) is the BLAS level-3 routine that multiplies a real double precision matrix by a real double precision matrix. It is also an important routine in the LINPACK benchmark.
DPOTRF and DPOTRS are the LAPACK routines that factor and solve a real double precision
symmetric positive definite system of linear equations using the Cholesky
method.
DTRSM is the BLAS level-3 routine that solves a real double precision
triangular system of equations with multiple right-hand-sides. This routine
constitutes a large percentage of the computation done in the LAPACK routines
that factor and solve a general system of linear equations, respectively DGETRF
and DGETRS.
Error
checking and correcting (ECC) is a method of detecting
and correcting errors in a computer having a memory subsystem
Embarrassingly
parallel problem is a computational problem for which
no particular effort is needed to divide the problem into a very large number
of parallel tasks, and there is no essential dependency (or communication)
between those parallel tasks. Some
examples of embarrassingly parallel problems include:
FLoating point Operations Per Second (FLOPS) is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations; similar to instructions per second.
Floating point unit (FPU) is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root.
In order for FLOPS to be useful as a
measure of floating-point performance, a standard benchmark must be available
on all computers of interest. One example is the LINPACK benchmark.
General
Matrix Multiply (GEMM) is a subroutine in the Basic
Linear Algebra Subprograms (BLAS) that performs matrix multiplication (i.e.:
the multiplication of two matrices.) This includes DGEMM, for double-precision.
GigaGLOPS is a measure of a
computer's speed; one billion floating point operations per second (see FLOPS)
Host/debug
port (HDP) is a host interface on the CSX600 processor
that allows the CSX600 to communicate with, and be controlled by, the system's
host processor. This port can also be used as a hardware and software debug
port as it provides full access to all the internal registers on the device.
IEEE
Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and
is followed by many CPU and FPU implementations. The standard defines formats
for representing floating-point numbers (including negative zero and denormal
numbers) and special values (infinities and NaNs) together with a set of
floating-point operations that operate on these values.
For more information see the IEEE
754 home at grouper.ieee.org/groups/754/
Instruction
Set Simulator (isim) is a functional accurate simulation
model of the CSX600.
Linear
Algebra PACKage (LAPACK) is a software library for
numerical computing that depends BLAS in order to effectively exploit the
caches on modern cache-based architectures. The CSXL library implements an accelerated version of LAPACK
and supports functions such LAPACK (DGERTF, DGESV.) See BLAS and CSXL.
LINPACK is a linear algebra software package that makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations. LINPACK is also a benchmark derived from it that consists of solving a dense system of linear equations. The LINPACK benchmark has different versions, according to the size of the system solved. The TOP500 ranking uses a version where the chosen system size is large enough to get maximum performance. For more information, read the ClearSpeed Accelerated LINPACK performance paper.
Mono
execution unit in the CSX600 processes mono (scalar or
non-parallel) data and handles program flow control such as branching and
thread switching.
mono is the Cn multiplicity specifier keyword that specifies the object exists in the mono domain (i.e..: a single instance.) For example mono int a; is equivalent to int a; See multiplicity specifier and poly.
Mono memory is memory associated with mono data; also referred to as local memory. There is one instance of the memory accessible by all PEs. This memory may be on chip and/or on the same PCB card as the CXS processor.
Mono variable is a variable that has one instance. This can be a basic or aggregate type.
Multiple
Instruction stream, Multiple Data stream (MIMD) is a
technique employed to achieve parallelism. Machines using MIMD have a number of
processors that function asynchronously and independently.
Multiplicity specifier provides the means of representing poly data by allowing the programmer to specify the domain in which the declaration will exist: mono or poly. The default multiplicity is mono. See mono and poly; and for a fuller explanation with examples, see Chapter 5 of SDK Getting Started Guide.
Multiply-Accumulate (MAC) is a common operation that computes the product of two numbers and adds that product to an accumulator. A MAC-unit consists of a multiplier implemented in combinational logic followed by an adder and an accumulator register which stores the result when clocked. The output of the register is fed back to one input of the adder, so that on each clock the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers.
Multi-Threaded
Array Processor (MTAP) contains an array of 96
Processing Elements (PEs) operating in SIMD. For more information, see ClearSpeed's CSX
Processor Architecture white paper.
PAPI (an acronym for Performance Application Programming Interface) aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events.
PetaFLOP is a measure of a computer's processing speed and can be expressed as a
thousand trillion floating point operations per second.
Peripheral
Virtual Component Interface (PVCI) is a standard
intended to simplify the interfacing of peripheral cores to on-chip buses in a
system-on-a-chip, by standardizing the interface between a core's internals and
its bus wrapper. See VCI.
poly is the Cn multiplicity specifier keyword for defining poly (parallel) data types. For example poly int b; See multiplicity specifier and mono.
Poly
execution unit is a SIMD array of 96 PEs that operate
in a synchronous manner where each PE executes the same instruction on its
piece of data.
Poly memory (also known as PE memory) is memory associated with poly data. Each PE has its own local block of poly memory; each instance of poly memory is only visible to the corresponding PE. Note, mono and poly memory are two physically distinct memory spaces, with their own memory maps.
Poly variable has many instances with, typically, different data values on each poly Processing Element (PE.) This can be a basic or aggregate type.
Processing Element (PE) is the basic PE array building block. A PE has an ALU, 32 + 64 bit FPU, a MAC, a register file, memory and I/O.
Programmable Logic Device (PLD) is an electronic component used to build
digital circuits.
Programmed Input/Output (PIO) is a mechanism for transferring data between PE (poly) memory and main (mono) memory.
Single
Instruction, Multiple Data (SIMD) is a technique
employed to achieve data level parallelism, as in a vector or array processor. First made popular in large-scale
supercomputers (as opposed to MIMD parallelization), smaller-scale SIMD
operations have now become widespread in personal computer hardware.
Single
Precision Single precision floating point is an IEEE
754 standard for encoding floating point numbers that uses 4 bytes.
Streaming
SIMD Extensions (SSE) is a SIMD (Single Instruction,
Multiple Data) instruction set designed by Intel. SSE contains 70 new instructions including both scalar and
packed floating point instructions.
System-on-a-chip (SoC) refers to integrating all components of a computer or other electronic system into a single integrated circuit (chip.)
TeraFLOP is a measure of a computer's speed; a trillion floating point
operations per second.
Virtual
Component Interface (VCI) is a proposed standard
interface between a core's internals and a core's bus wrapper. The VCI is a far simpler protocol than
a typical bus protocol, since it is a point-to-point transfer protocol.
ZGEMM is the BLAS level-3 routine that multiplies a complex double precision
matrix by a complex double precision matrix.
ZGEMM3M is an implementation of ZGEMM that requires 3 real matrix
multiplications and 5 real matrix additions to compute the complex matrix
product; ZGEMM uses 4 real matrix multiplications and 2 real matrix additions.
This may be faster than the standard implementation under certain circumstance.