Shakes Me Up: SuperComputer

Showing posts with label SuperComputer. Show all posts

Tuesday, July 16, 2013

Error: Unterminated character constant beginning at (1)

http://gcc.gnu.org/ml/fortran/2007-12/msg00212.html

Error: Unterminated character constant beginning at (1)
aaa.f:4.72:

           print *, 'Try one of "Skip", "Test", "Verbosity" or "Cleanup"
                                                                       1
Warning: Line truncated at (1)

Fixed-form Fortran has only 72 characters per line. What happens with
longer lines is implementation dependent. However, many
programmers/programs assume that everything after column 72 is ignored
(which gfortran does).

Solution: Fix the program by splitting the line or by using the 
-ffixed-line-length-n (n is a non-negative integer).

Friday, July 12, 2013

HDF5 and Ubuntu

http://micro.stanford.edu/wiki/Install_HDF5#Ubuntu

Sunday, October 28, 2012

Cache to Ram Ration

A processor might have 512 KB of Cache and 512 MB of RAM.

There may be 1000 times more RAM than cache.

The cache algorithms have to carefully select the 0.1% of the memory that is likely to be most accessed.

A cache line contains two fields

Data from RAM
The address of the block is called the tag field.

Mapping:

The memory system has to quickly determine if a given address is in the cache.

Three popular methods of mapping addresses to cache locations.

-- Fully Associative
Search the entire cache for an address.
--Direct
Each address has a specific place in the cache.
--Set Associative
Each address can be in any of a small set of cache locations.

Searching Problem
Knowledge of searching

Linear Search O(n)
Binary Search O(log2 (n))
Hashing O(1)
Parallel Search O(n/p)

Associative Mapping
The data from any location in RAM can be stored in any location in cache.

When the processor wants an address, all tag fields in the cache as checked to determine if the data is already in the cache.

Each tag line requires circuitry to compare the desired address with the tag field.

All tag fields are checked in parallel.

Set Associative Mapping

Set associative mapping is a mixture of direct and associative mapping.

The cache lines are grouped into sets.

Replacement policy

When a cache miss occurs, data is copied into some location in cache.

With Set Associative of Fully Associative mapping, the system must decide where to put the data and what values will be replaced.

Cache performance is greatly affected by properly choosing data that is unlikely to referenced again.

Replacement Options
First In First Out (FIFO)
Least Recently Used (LRU)
Pseudo LRU
Random

Comparison of Mapping Fully Associatve

Associate mapping works the best, but is complex to implement. Each tag line requires circuitry to compare the desired address with the tag field.

Some special purpose $, such as the virtual memory Translation Lookaside Buffer (TLB) is an associative cache.

Comparison of Mapping Direct.

Has the lowest performance, but is easiest to implement. Direct is often used for instruction cache.

Sequential addresses fill a cache line and then go to the next cache line.

CUDA Shared Memory broadcast

Multiple addresses map to same memory bank

Accesses are serialized
Hardware splits request into as many separate conflict-free requests as necessary
Exception: if all access the same address: broadcast

However, recent large improvements in CUBLAS and CUFFT performance were achieved by avoiding shared memory in favor of registers -- so try to use registers whenever possible.

If all threads read from the same shared memory address then a broadcast mechanism is automatically invoked and serialization is avoided. Shared memory broadcasts are an excellent and high-performance way to get data to many threads simultaneously.

It is worthwhile trying to exploit this feature whenever you use shared memory.

----Rob Farber
www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731

CUDA, Supercomputing for the masses: Part 5

Sunday, October 14, 2012

Inside Nehalem: Intel’s Future Processor and System

http://www.realworldtech.com/nehalem/7/

L1D Cache?

Inclusive caches are forced by design to replicate data, which implies certain relationships between the sizes of the various levels of the cache. In the case of Nehalem, each core contains 64KB of data in the L1 caches and 256KB in the L2 cache (there may or may not be data that is in both the L1 and L2 caches).

This means that 1-1.25MB of the 8MB L3 cache in Nehalem is filled with data that is also in other caches. What this means is that inclusive caches should only really be used where there is a fairly substantial size difference between the two levels. Nehalem has about an 8X difference between the sum of the four L2 caches and the L3, while Barcelona’s L3 cache is the same size as the total of the L2 caches.

Nehalem’s cache hierarchy has also been made more flexible by increasing support for unaligned accesses.

As a result, an unaligned SSE load or store will always have the same latency as an aligned memory access, so there is no particular reason to use aligned SSE memory accesses.

Saturday, October 13, 2012

TLB Translation Lookaside Buffer

A TLB has a fixed number of slots that contain page table entries, which map virtual addresses to physical addresses. The virtual memory is the space seen from a process. This space is segmented in pages of a prefixed size. The page table (generally loaded in memory) keeps track of where the virtual pages are loaded in the physical memory. The TLB is a cache of the page table; that is, only a subset of its contents are stored.

The TLB references physical memory addresses in its table.