Home
Home > About > Presentations—Gelato ICE | San Jose | April 2006

Presentations—Gelato ICE | San Jose | April 2006

Monday, April 24

 

Welcome

Mark K. Smith, Gelato Central Operations

Welcome, introduction, and overview of Gelato Federation activities.

Presentation (pdf, 3 MB)

 

Keynote—Itanium: Its Rationale and Potential from an HP Labs Perspective

William S. Worley, Secure64 & Itanium Solutons Alliance

The Intel/HP Itanium architecture definition effort started with the results of an HP Labs research program, called PA Wide Word internally, conducted from January 1990 to December 1993. Concepts and conclusions formulated during this research program established technical principles for a fundamental advance in processor architecture and led to the Intel/HP partnership. Less noticed in published accounts is the fact that many capabilities Intel and HP jointly innovated in the Itanium architecture were specifically designed to enable construction of secure systems.

Non-security objectives have led modern general-purpose operating systems to continue to rely upon a more than 40-year-old, CPU-only, hardware protection model. This limited hardware protection model simply is incapable of supporting the levels of remote-attack security required in today's massively complex systems, in today's online world. As a result, we find vulnerable servers surrounded by vulnerable external protective appliances. All require periodic patching and re-testing. It's not clear the good guys are winning.

Intel's Itanium 2 systems now offer the means for building "inherently secure" systems. Inherently secure means that the software controlling the hardware platform has specific, strong security properties. Without an inherently secure foundation, the current trends of virtualizing servers and consolidating network protective appliances magnify, rather than mitigate, security risks. Secure64's inherently secure hardware platform control software fully utilizes the capabilities of the Itanium architecture to provide such a foundation. This offers substantial benefits for information systems and infrastructures, and can establish Itanium hardware platforms as the winners both for secure consolidation and for secure virtualization.

Presentation (pdf, 1.5 MB)

 

Oracle: An Enterprise Itanium Use Case Study

Brian Hirano, Oracle

Oracle's 4-way Itanium 2 TPC-C benchmarks, announced in November of 2002, were the culmination of a two-year project involving engineers from Intel, HP, and Oracle. Since that time, multiple groups in Oracle and Intel have continued to work closely on multiple versions of Oracle and Linux-based Itanium platforms to ensure performance and stability for enterprise solutions. This talk discusses the initial performance work and the evolution of Oracle's and Intel's focuses, and presents some of the current areas Oracle and Intel are jointly investigating.

Presentation (pdf, 300 KB)

 

Mathematical Modeling to Formally Prove Correctness

John R. Harrison, Intel

Formal verification attempts to establish the correctness of a computer artifact (hardware, software, microcode, protocol, etc.) by rigorous modeling and mathematical proof, rather than merely by testing or simulation. Formal verification in the hardware industry is widely practiced, and increasingly seen as necessary. We can perhaps identify at least three reasons:

  • Hardware is designed in a more modular way than most software, with refinement an important design method. Constraints of interconnect layering and timing means that one cannot really design "spaghetti hardware."
  • More proofs in the hardware domain can be largely automated, reducing the need for intensive interaction by a human expert with the mechanical theorem-proving system.
  • The potential consequences of a hardware error are greater, since such errors often cannot be patched or worked around, and may in extremis necessitate a hardware replacement.

It is not surprising that a considerable amount of effort has been in the floating-point domain. Floating-point algorithms have proven themselves difficult to get right. Yet in marked contrast to some other targets for formal verification, it is not hard to come up with widely accepted formal specifications of how floating-point operations should behave. In fact, many operations are specified almost completely by the IEEE Standard governing binary floating-point arithmetic. However, in some other respects, floating-point operations present a difficult challenge for formal verification. We will describe some of our work in formally verifying algorithms for operations such as division, square root, and transcendental functions for the Intel Itanium architecture.

Presentation (pdf, 100 KB)

 

Preparing for the First Beam at the LHC

Lawrence Pinsky, University of Houston

The Large Hadron Collider (LHC) at CERN, the European Laboratory for Particle Research in Geneva, Switzerland, is expecting to have the first beam next year. This is the culmination of more than a decade construction project, including the development of the supporting software and computing models. ALICE is one of the four major detectors that is being prepared for physics at the LHC, and the University of Houston is a member of the US contingent of institutions involved in that experiment. Along with the Ohio Supercomputer Center and the facility at NERSC (LBL), the University of Houston Itanium cluster has been participating in the increasingly severe sequence of "data challenges" that are being wrapped up now in preparation for the actual turn-on of the LHC. The ALICE computing model is necessarily dependent upon a grid-based model that will include many different platforms, Itanium among them. The data challenges have provided a good venue to compare the relative attributes of the various platforms in running the kind of simulations and analysis codes that are relevant to particle physics applications. An overview of these results will be presented along with a summary of the overall ALICE computing plans and the status of its deployment.

Presentation (pdf, 2.4 MB)

 

Computing Optimal Equilibrium Strategies for Network Economies

Alejandro Jofré, University of Chile

Models for regulating, planning, and operating industries working on networks such as energy, transportation, and telecommunication are key ingredients today of which to take advantage. These models correspond to large stochastic optimization/equilibrium problems, which are very difficult to solve. In this talk, we will show three new distributed algorithms/strategies to compute a solution and its implementation on Itanium 2 clusters. This family of models and/or solutions are currently used by several companies and institutions participating in these industries.

Presentation (pdf, 3 MB)

 

Basic Itanium Architecture

Cameron McNairy, Intel

The Itanium architecture and the paradigm of explicit parallel instruction computing (EPIC) are often poorly understood. This presentation will cover important aspects of the EPIC paradigm, including software pipelining, register save engine, predication, parallel instruction groups, data and control speculation, and many other mysteries of the Itanium application and system architectures.

Presentation (pdf, 560 KB)

 

Columbia Application Tuning Case Studies

Johnny Chang, National Aeronautics and Space Administration

This talk will present several case studies of application performance enhancements on the SGI Altix platform. The enhancements include both explicit (dplace) and implicit (cpubind/cpuset_pin) process-pinning, eliminating memory contention in OpenMP applications, eliminating unaligned memory accesses, and system profiling. These enhancements enabled 2- to 20-fold improvements in application performance.

Presentation (pdf, 15.7 MB)

 

Kernel Optimization for Enterprise Workloads

Kenneth Chen, Intel

Linux has been receiving a great deal of attention in the past few years. The popularity is being propelled by a wide range of adoption of Linux for enterprise computing. Major software vendors have been supporting their products on Linux for many years. As the enterprise software solution stack builds up everyday, it is crucial that Linux kernel development takes this opportunity to ensure that the kernel provides necessary infrastructure for enterprise application to excel. This means developing enterprise focused OS features, improving performance by extending the scalability, as well as improving many other areas.

Adding to the excitement, the Intel Itanium 2 processor is built with many innovative features that push the performance envelope. Featuring massive caches and CPU execution resource, EPIC technology (Explicitly Parallel Instruction Computing) provides a variety of optimization opportunities. In this talk, we will highlight kernel optimization work done on Linux-ia64, ranging from several critical low level assemblies to generic kernel components. We will present how the linux-ia64 kernel utilizes Itanium architecture features to extend scalability and performance for enterprise workloads.

Presentation (pdf, 400 KB)

 

Mathematical Libraries and the Implementation of Parallel Solvers for Engineering

Hugo Daniel Scolnik, University of Buenos Aires

Our research is focused on developing a highly efficient parallelizable solver of huge systems of linear equations that arise from finite element discretizations of complex nonlinear engineering problems. Those problems are nonlinear, require many linearizations, and hence several days of CPU time on Itanium platforms. Another important application is the reconstruction of tomographic images.

This work includes a comparison of the mathematical libraries like MKL (Linux) and MLIB (HP) from the point of view of the performance on numerical problems using sequential and parallel implementations. The new solver uses BLAS routines at levels 1,2,3, excluding complex data types. The conclusions of our study present the results obtained with several problems.

Presentation (pdf, 1.1 MB)

 

HP Caliper: An Update to the Linux IPF Performance Tool

Curt Wohlgemuth, HP

Steve Williams, HP

HP Caliper is a sophisticated general-purpose performance analysis tool that takes advantage of the Itanium processor's advanced performance monitoring unit to provide detailed and accurate performance measurements at the application and system level with minimal perturbation to the system's behavior.

Besides an overview of HP Caliper, we will discuss new features, including system-wide profiling and a new graphical user interface based on the rich client platform of Eclipse.

Presentation (pdf, 880 KB)

 

The ISP RAS Effort to Improve GCC for Itanium

Arutyun I. Avetisyan, Institute for System Programming, Russian Academy of Science

Ongoing work at ISP RAS on improving GCC for Itanium processors will be presented. Discussion will cover a past project with HP on improving GCC instruction scheduling and the current effort on implementing a new VLIW-targeted instruction scheduler. Future plans on improving GCC for Itanium and potential collaboration projects will also be presented including plans for a GCC meeting in Moscow this summer.

Presentation (pdf, 992 KB)

 

Suggested Improvements in Itanium and Software

Clemens C. J. Roothaan, Gelato Honorary Member

In general, the Itanium is a major step forward in computer design. Nevertheless, there are still gaps in the instruction repertoire, and the specifications of some instructions could be expanded or modified.There are also some mandates by C++ concerning corner cases, which cannot be justified by any mathematical reasoning whatsoever; there is even one IEEE mandate that cannot pass muster. A detailed list of shortcomings and possible remedies will be presented for your consideration.

Spreadsheet (xls, 173 KB)

 

An Evaluation of High Performance Octave on Itanium

Ashok Krishnamurthy, Ohio Supercomputer Center

GNU Octave is a MATLAB-style interactive application for performing numerical computations. The Octave language is mostly compatible with MATLAB. MATLAB (and Octave) are being used as an executable specification language to develop synthetic compact applications for the DARPA HPCS program. This work has identified a clear need for a MATLAB-style interpreter that can handle large address spaces, run on multiple processors, and leverage high-performance interconnects.

The Ohio Supercomputer Center (OSC), Ohio State University, and Indiana University have been collaborating on research and software technologies for parallel Octave. We have constructed a version of parallel Octave for the Itanium 2 cluster at OSC. This interpreter has a 64-bit address space for large matrix support and uses the high-bandwidth Myrinet interconnect. This talk will review the software architecture, performance and scalability of parallel Octave on the OSC Itanium 2 cluster.

Presentation (pdf, 453 KB)

 

VTune Update

Paul M. Cohen, Intel

This talk will cover what's new for tuning Intel Itanium 2-based applications, including native Eclipse IDE and NUMA aware support for data collection.

Presentation (pdf, 501 KB)

 

GCC IP Issues

Dan Berlin, Google

This talk will cover a variety of intellectual property issues that come up during working on GCC, including:

  • Copyright: Assignments of copyright, and how we deal with issues of contributions of code from from other open source/commercial projects.
  • Patents: How we deal with them in GCC, and what we require of companies that are going to contribute to GCC.
  • General other issues related to intellectual property and GCC.

Presentation (pdf, 213 KB)

 

Open64: An Alternative Backend for GCC

Shin-Ming Liu, HP

While GCC's Tree-based SSA optimization has been making good progress, the Itanium processor may benefit more in the near future from alternate high-performance optimizations. The Open64 compiler is the basis of the Open Research Compiler (ORC), which Intel has been promoting for Itanium-specific optimizations over the past couple years. This effort aims to present Open64 as an alternative backend for GCC/G++ on the Itanium/Linux platform. In addition to Itanium, this alternative backend supports the EM64T/IA32 target as well as several other embedded processors. In alignment with this effort, HP is coordinating the update of the GCC/G++ front-end and driving the quality on the Itanium/Linux platform. In this talk, the short- and long-term perspectives of this alternative backend will be presented.

Presentation (pdf, 485 KB)

 

Evolution of PCI IO: A Linux IO Geek's Perspective on HW

Grant Grundler, HP

PCI has been around since 1993 and has seen substantial changes since its conception. New features and functionality have been introduced with each generation (e.g. 64-bit, 3.3v, MSI, MSI-X, Split transactions, etc). PCI-e is the latest generation and is *not* HW compatible with previous generations. This gave HW vendors the "opportunity" (forced them really) to re-implement and take advantage of some of the features PCI-e offers.

This talk will explain a few PCI features and broken HW implementations, and will cover the reasons why PCI-e is an improvement over previous PCI-X implementations.

Presentation (pdf, 1.1 MB)

 

Superpages / VM Work

Ian Wienand, University of New South Wales

This talk will present a short overview of Gelato@UNSW's latest work on issues relating to Itanium Linux virtual memory. Our work revolves around both taking advantage of unique properties of the Itanium MMU and some more "radical" ideas for overhauling parts of the Linux VM layer. Topics touched on will include using the long-format VHPT, strategies for providing dynamic superpages, and approaches for greater abstraction within the Linux VM implementation.

Presentation (pdf, 65 KB)

 

An Update on Xen on Itanium

Alex Williamson, HP

Xen is rapidly becoming the de facto standard for open-source virtualization, with capabilities and performance matching or exceeding leading industry products. Paravirtualization techniques, efficient inter-domain virtual I/O mechanisms, clever migration, and support for multiple architectures (including VT and Pacifica hardware) have contributed to a large broad base of developers and piqued industry interest. Xen/ia64 is the first non-x86 architecture supported by Xen. It is still a work-in-progress, but the core hypervisor component utilizes code and/or experience from Xen, Linux/ia64, and the HP vBlades research project. Many interesting strategies are employed to ensure correctness, optimize performance, and leverage the many rapidly developing layers of tools provided by Xen.

We will provide a brief overview of virtualization in general, Xen specifically, and the current status of Xen/ia64. Then, we will spend the remaining time discussing some interesting details about the inner workings of Xen on Itanium.

Presentation (pdf, 1.8 MB)

 

An Update on the Current State of Open|SpeedShop

Jack Carter, SGI

Open|SpeedShop is SGI's next generation Linux performance analysis tool. Based on the concepts of SGI's IRIX SpeedShop, Open|SpeedShop is designed to be modular and easily extendable. It supports the concept of plugins, which allow users to create their own performance experiments. Another key feature of the performance tool is its usability. Its user interface is designed for scientists in general, not just computer scientists. Open|SpeedShop currently supports 4 user interfaces: GUI, interactive command line, batch command file and as a pure python module. The Open|SpeedShop baseline functionality includes support for single system image (SSI) machines and for clusters (i.e. multiple OS kernels).

Current experiments are exclusive and inclusive user time, program counter (PC) sampling, MPI call tracing, input/output tracing, floating point exception tracing, and CPU hardware performance counter experiments. Open|SpeedShop enables FORTRAN (77, 90, and 95), C, and C++ programmers to use an advanced performance analysis tool within the open-source environment. The infrastructure and base components are released as open source under the GPL and LGPL licenses. Open|SpeedShop is being co-funded by the Department of Energy (DOE).

Presentation (pdf, 962 KB)

 

Aliasing in GCC

Dan Berlin, Google

This talk will cover aliasing in GCC, including:

  • An overview of the algorithms used to generate aliasing information.
  • An overview of how the aliasing information is represented in GCC's IR.
  • The improvements made in recent GCC versions to both of the above.

Presentation (pdf, 239 KB)

 

Superblock Update

Robert Kidd, University of Illinois at Urbana-Champaign

Superblock scheduling is a common technique to increase the level of ILP in generated code. By performing tail duplication, a Superblock-forming compiler creates a longer extended basic block, simplifying the task of moving instructions across basic block boundaries. More significantly, the control flow into the duplicated tail is dramatically simplified. This allows the compiler to draw much tighter bounds on the conditions that exist when the block is executed and allows the code in the block to be specialized for those conditions. This combination of radical control flow transformation followed by specializing optimizations, termed structural compilation, has been shown in the OpenIMPACT compiler to be particularly useful in developing ILP when compiling for the Itanium processor.

As a first step toward developing structural compilation techniques in GCC, we implemented Superblock formation at the Tree-SSA level. By performing structural transformations early, we give the compiler's high level optimizers an opportunity to specialize the transformed program, thereby cultivating higher levels of ILP. The early results of this modification are mixed, with some benchmarks improving and others slowing. I will present the effects of this structural transformation on later optimizations and thoughts on the changes that will be necessary to allow optimizations to benefit from this transformation.

Presentation (pdf, 165 KB)

 

An Interblock VLIW-Targeted Instruction Scheduler for GCC

Andrey Belevantsev, Institute for System Programming, Russian Academy of Science

Modern VLIW architectures (e.g. Itanium) require instruction level parallelism (ILP) to be explicitly exposed by a compiler. An instruction scheduler is a key compiler component for utilizing ILP. The current GCC scheduler has a number of pitfalls in approaching this goal, including: the oldest interblock scheduling algorithm, non-optimal region formation, a traditional two-pass execution scheme, and lack of transformations for eliminating false dependencies.

This presentation will cover an ongoing approach for implementing a new aggressive instruction scheduler for GCC. The scheduling algorithm is based on a selective scheduling approach. It is mainly targeted for VLIW-like platforms, but the framework being implemented is general enough and it can be used for other targets in the future. The key features of the approach are as follows: works with DAG regions, supports code motion with adding bookkeeping insns, supports register renaming and forward substitution, and integrates with software pipelining. We will discuss the algorithm and its adaptation to GCC, implementation issues, and the current state of the project.

Presentation (pdf, 169 KB)

 

Parallel Programming with GCC

Diego Novillo, Red Hat

Multiprocessor systems are becoming increasingly popular, but taking advantage of their parallel capabilities is not always straightforward. Software developed for these systems must explicitly make use of concurrency.

In this talk, I will describe two recent additions to the GNU Compiler Collection (GCC) for developing software that can take advantage of parallelism: vectorization and OpenMP. Vectorization is a compiler feature that takes advantage of the multimedia capabilities of modern CPUs by offloading the execution of some inner loops into separate co-processors. OpenMP is a standard specification of compiler directives for C, C++, and FORTRAN. It provides new directives to specify parallelism, synchronization, and data sharing. This talk will describe both features in detail, provide usage examples, and give tips to take full advantage of these features when developing your applications.

Presentation (pdf, 395 KB)

Tuesday, April 25

 

Keynote—Trends in Computer System Design

Jerry Huck, HP

This presentation will examine the issues and tradeoffs in high-performance commercial system design. The current family of chipsets and system enclosures from HP will be used to examine how system requirements influence design choices. These requirements include performance, power, reliability, availability, serviceability, and manageability.

Presentation (pdf, 1.5 MB)

 

Enterprise Graphics on IPF

Hansong Zhang, SGI

Large shared memory achitectures enable friendly programming models and allow efficient processing and visualization of large data sets produced in the areas of computer-aided design (CAD), science and engineering simulations, and new high-resolution sensor technology. In this talk, we'll look at the SGI Altix multiprocessor systems as an example of large shared memory architectures. We'll then showcase applications that have a large memory footprint in genome matching and visualization of CAD and high-resolution sensor data.

Presentation (pdf, 3.5 MB)

 

Valgrind

Julian Seward, OpenWorks

Valgrind is a GPL'd suite of simulation-based debugging and profiling tools for Linux. Around a common core a number of tools have been built, two of which are Memcheck, a memory error detector, and Cachegrind, a low-level cache profiler. The system is structured as a common core, which provides CPU virtualization, debug info management, and error management, and handles other simulation nasties, particularly signals, threads, and syscalls. The rich set of services provided by the core makes it relatively easy to build sophisticated dynamic analysis tools. The project Web site is http://www.valgrind.org.

Valgrind currently runs on {x86,amd64,ppc32,ppc64}-linux. A key component is dynamic-translation based CPU virtualization. This converts blocks of code into an architecture-neutral intermediate representation, hands them to the currently active tool for instrumentation, and then re-synthesizes runnable code from them. In this talk, I will take a look at the challenges of porting this and other important Valgrind components to Itanium.

Presentation (pdf, 260 KB)

 

LTO: A Brief Introduction

Mark Mitchell, CodeSourcery

Many compilers have obtained significant performance wins by using "link-time optimization," i.e. by performing optimizations that cross the boundaries of a single program unit. For example, if the argument to a function is a constant in one module and the function is defined in another module, the result of the function call may be constant as well. But compilation of either module independently cannot determine that fact.

The GNU compiler collection (GCC) does not presently implement link-time optimization, although it does provide a limited form of inter-module optimization, as implemented by Geoff Keating. Working with partners at AMD, HP, and IBM, we have developed a proposal for implementing link-time optimization in GCC based on serializing GCC's existing data structures. Thus, our proposal is conservative in that it leverages GCC's existing data structures and requires only minimal changes to GCC's core optimizers. A significant advantage of our approach is that the serialized data structures will be available to other consumers, such as program analyzers and IDEs. Finally, our approach would facilitate the implementation of the most significant missing feature in G++: the "export" keyword.

Presentation (pdf, 365 KB)

 

Local and Remote Memory: Memory in a NUMA System

Christoph Lameter, SGI

Memory becomes difficult to handle in a NUMA system because storage is available at various "distances" from the running process. A higher distance means longer latency or less bandwidth, and therefore implies slower access to memory. Performance in a NUMA system depends on assigning available memory to processes in such a way that memory access speed is optimized. The kernel has various mechanisms to automatically or manually control NUMA memory placement.

The page allocator attempts to locate memory that is near the node where a process is executing. However, if the data is to be later used by processes running on other nodes, then memory would not be allocated in the best way. The kernel allows manual control of memory allocation per process via memory allocation policies. Similar issues occur in the SLAB allocator. The SLAB allocator was revised last year in order to insure that allocations occur in an optimal way and that allocations are controllable in the same way as the page allocator.

The kernel itself must be aware of where its own data structures will be placed and insure that data to be used by certain processors is on memory nodes local to these processors. Improvements in this area enhanced placement of core kernel structures and also allow device drivers to place their data local to hardware devices. Finally, the kernel now has the ability to migrate the physical location of pages to improve performance after a process has been reassigned to a processor on another node.

Presentation (pdf, 124 KB)

 

Experiences on the Itanium-Based Grid Test Bed at UPRM

Wilson Rivera, University of Puerto Rico Mayaguez

The Parallel and Distributed Computing Laboratory (PDCLab) at the University of Puerto Rico, Mayaguez, has deployed an experimental grid test bed to perform research in the area of grid computing. The PDCLab grid test bed was deployed using components that allow flexible re-configuration, management, and programmability. The test bed was built upon heterogeneous components including an Itanium based cluster. This presentation provides discussion about the hardware and software configurations of the grid test bed, the rational used to choose each of grid components, and the research issues being investigated.

Presentation (pdf, 1.4 MB)

 

LLVM: A Brief Introduction

Chris Lattner, Apple

This talk will provide a brief introduction to LLVM (http:// llvm.org), focusing on LLVM's robust interprocedural link-time optimization, runtime optimization, and just-in-time code generation support. Work is currently underway to integrate LLVM's mid-level and interprocedural optimization capabilities into the GNU Compiler Collection (GCC) compiler. Design, implementation, and status of GCC integration will be discussed.

Presentation (pdf, 262 KB)

 

In Search of Collaboration

Ping-Hui Kao, HP

In advancing Linux on Itanium, there are many technical areas crying out for collaboration. HP is, in particular, interested in collaborating with research institutes, universities, and vendors in three areas: scalability, virtualization, and GCC. We believe these three areas are critical to the success of Itanium. In this presentation, we will present the needs in the three areas from HP's viewpoint. We will have short discussions on your needs as well. This goal is to stimulate off-line discussions concerning potential collaborations. In some cases, it could leads to funding from HP. Please join us to start more collaborations.

Presentation (pdf, 117 KB)

Wednesday, April 26

 

Keynote—The Road Ahead: Intel Itanium Architecture and Software

Don Soltis, Intel

James Reinders, Intel

Join us to hear about the road ahead for Itanium processors from hardware and software experts working at Intel. Itanium processor-based systems are winning in traditional RISC markets such as scalable enterprise, high-performance computing (HPC) and mainframe replacement. These markets require robust throughput, scalar and floating-point (FP) performance of the processor, as well as its memory and I/O system. Future Itanium processor designs will feature increased core count, higher operating frequencies, increased memory bandwidth, and lower memory latency. Itanium processor-based systems span from a few to thousands of processors.

Itanium processors are well suited for these varied environments because of high reliability, agile configurability, and strong software support. Enhanced reliability results from specialized soft error resistant circuits, integrated checking and error recovery algorithms along with extensive error checking and correction of array elements and datapaths. Processor reliability is increasingly critical due to virtualization applications because virtual processors may exist on each physical processor and an unrecoverable soft error of a physical processor would affect many virtual processors. Intel designs excel at addressing this challenge. Itanium processor designs also benefit greatly from the unmatched manufacturing capabilities and silicon processing experience of Intel, as well as a strong software ecosystem and excellent software development products.

Presentation (pdf, 1.5 MB)

 

Highlights of the Upcoming October Gelato Conference

Jon Lau, National Grid Office

This presentation will highlight the next Gelato ICE: Itanium Conference & Expo to be held October 1-4 in Singapore.

Presentation (pdf, 1.5 MB)

 

A Dynamic Instrumentation-Based System for Building Program Analysis Tools for the IPF Platform

Jasper Kamperman, Intel

We will present a dynamic instrumentation-based system called Pin for building a variety of program analysis tools for the IPF platform. In this talk, we will introduce the basic concepts of dynamic instrumentation and provide details of the inner working of this system. We will also talk about various optimizations that happen in this system to ensure that programs running under control of Pin perform reasonably well. Some specific features of IPF, which create challenges for building a system like Pin, will be explored. We will provide several real world examples of how this system has been used for building program analysis tools. We will also talk about various applications of this system in building tools for architecture research and performance analysis.

Presentation (pdf, 1.4 MB)

 

Itanium Virtualization and vNUMA

Matthew Chapman, University of New South Wales

In recent years, virtualization has become a hot technology, being widely deployed for applications such as server consolidation. However Itanium, like x86, was not originally designed with virtualization as a goal. In this presentation, I will talk about the challenges of virtualizing the Itanium architecture. I will present the various possible approaches, including para-virtualization, pre-virtualization (an automated technique we have developed), and hardware-assisted virtualization in the form of Intel Virtualization Technology.

I will also provide an overview of vNUMA, a novel application of these virtualization techniques. vNUMA provides a virtual ccNUMA-like environment on a cluster, by transparently implementing shared memory underneath the operating system. Thus, a single instance of an existing operating system such as Linux can run across multiple nodes of a cluster. While the general principles are applicable to any architecture, the initial version has been built for Itanium systems.

Presentation (pdf, 124 KB)

 

Hardware Overview

Jeff Donsbach, HP

This presentation will be an overview of the current Itanium product lines offered in the marketplace and a quick summary of the integral system specifications that set these systems apart.

Presentation (pdf, 427 KB)

 

Blktrace: An Overview

Alan Brunelle, HP

"You can't count what you can't measure" is an old software engineering truism that inspires one to develop means to accurately and efficiently measure the various subsystems within Linux in order to make concrete performance improvements to the Linux kernel itself. Given that measuring how Linux manages I/O is a key component towards understanding overall system performance, Jens Axboe has recently been working on a new capability within Linux called Blktrace, which allows one to efficiently capture block I/O subsystem events for later analysis.

This presentation will start by providing an overview of Blktrace through a discussion about its kernel implementation and an overview of the utilities provided to capture traces. We will then show how it is currently being used to measure the LVM/DM subsystem as part of an effort to understand Linux IO performance from top-to-bottom.

Presentation (pdf, 495 KB)

 

Bioinformatics in Biomining

Nicholas Loira, University of Chile

Andres Aravena, University of Chile

Given the explosive growth of genomic databases in recent times, the development of efficient searching tools becomes more relevant every day. In particular, the design of biochemical elements used for bioidentification experiments requires search algorithms incorporating specific biological constraints. These experiments are designed to identify the biological diversity of metagenomic or environmental samples and are useful in ecology, environmental studies, and infection diagnosis, among others. We are focusing on text search algorithms for short words (under 60 symbols) where a small number of substitutions are allowed. The databases used are in the order of gigabytes. We have developed an efficient solution for this problem, which can take advantage of the Itanium 2 architecture. In this work, we will present a comparative study of performance of this algorithm on several architectures. This work is being developed at the Laboratory of Bioinformatics and Mathematics of Genome, Center for Mathematical Modeling, University of Chile.

Presentation (pdf, 260 KB)

 

MCA: Machine Check Architecture

Cameron McNairy, Intel

The Itanium Machine Check Architecture (MCA) is at the center of the Itanium reliability, availability, and serviceability (RAS) approach. Itanium's MCA defines methods and requirements that tie together the processor, processor abstraction layer (PAL), system abstraction layer (SAL), operating system (OS), and application. This presentation will cover the various components and their roles, and then turn the focus to the MCA foundations; the PAL and the processor that it abstracts.

Presentation (pdf, 520 KB)

 

Itanium Firmware (EFI)

Jeff Donsbach, HP

This session provides an overview of the extensible firmware interface (EFI), which is used to manage system boot, install, diagnostics, and firmware properties.

Presentation (pdf, 674 KB)

 

Scaling Linux to 512 Processors and Beyond

John Hawkes, SGI

SGI's Altix family of servers currently supports up to 512 Intel Itanium 2 processors and four terabytes of cache-coherent shared main memory, and newer platforms will substantially increase those limits. Some high-performance computing workloads benefit from executing on maximum hardware configurations and in a single system image environment. In the past few years as hardware capacity has increased, SGI and the Linux community in general have pushed kernel scalability to keep up. This presentation discusses the technical challenges of scaling to hundreds, even thousands, of processors and many terabytes of memory, what has been done to overcome those challenges, and what work remains.

Presentation (pdf, 150 KB)

 

Decimal Floating-Point

John Crawford, Intel

The IEEE 754 Floating-Point Standard is up for revision. A major new addition is a decimal data type and computation rules. This talk will define and motivate the need for decimal (vs binary) floating-point (FP), and demonstrate that a binary-integer based implementation is effective for high-performance software emulation, as well as being amenable to sharing hardware with binary FP units for maximum efficiency and leverage.

Presentation (pdf, 118 KB)

 

An Overview of Common Interconnects for Commodity Clusters

Doug Johnson, Ohio Supercomputer Center

There are a wide variety of interconnects for commodity clusters. Determining the appropriate network when constructing a new cluster can be seen as a daunting experience. This presentation intends to give a hardware overview of the more common interconnects available, their performance, and a comparison of the software available for the hardware.

Presentation (pdf, 300 KB)

 

A Systematic Approach to Tuning Software,

Sverre Jarp, European Organization for Nuclear Research

In this talk, we will look at performance optimization and bottleneck identification. In order to optimize an application, one needs to understand the "phase space" defined by the hardware and the external software. One also needs to understand the application itself: the algorithms used and the overall impact on the hardware platform. Furthermore, one needs to know which hardware/software tools are available for performance work.

This talk will therefore try to define a systematic and detailed approach in this field:

  • Definition the hardware/compiler phase space:
    • CPU specifications (frequency, microarchitectural features) multi-core designs, cache sizes, bus speeds, chip sets, I/O rates, etc.
    • Compilers (versions, features, flags, etc.) The compilers' encounters with the application software, algorithms, programming style, etc.
  • Review of performance tools (hardware/software)
  • Illustration of measurements inside our phase-space with a few applications, ideally in three forms:
    • Software kernels (testing only one feature at a time)
    • Well-known physics benchmark jobs (typically with emphasis on one physics feature, such as tracking in detector geometries, etc.)
    • Full-blown applications (e.g. a physics simulation framework, etc.)

The talk will hopefully provide some answers, and also give the audience enough "ammunition" to get started on their own.

Presentation (pdf, 595 KB)

 

OpenMP: Past, Present, and Future

Timothy Mattson, Intel

As the industry moves to multi-core processors, multi-threaded software will be essential. OpenMP is the industry standard API for writing multi-threaded software. It is focused on the needs of applications programmers and attempts to make it relatively simple to write parallel software. In this talk, we will discuss the history of OpenMP, some of the more innovative ways its being used today, and OpenMP innovations you can expect to see in the future.

Presentation (pdf, 531 KB)

 

NFS Performance

Peter Chubb, University of New South Wales

There have been many complaints about NFS performance on the Linux kernel mailing lists when it is compared with performance on IRIX or Solaris. Is it *really* so bad? And, what can be done to fix the problem? Over the Southern Summer, Gelato@UNSW has been trying to find out. We currently have tools to capture traces from real systems, anonymize them (so that real users don't mind if we grab information), and replay at a higher rate. In doing so, we have discovered, firstly, that there are problems; secondly, that there is a degree of regularity in most traces that can be exploited to improve NFS performance generally. This is a work-in-progress talk; we expect to have more results by the time of this conference.

Presentation (pdf, 1.3 MB)

 

Scalability Mini-Track Wrap Up

Lee Schermerhorn, HP

In each of the scalability presentations, we will try to leave time for questions and answers. However, we expect/hope that attendees will have additional scalability questions, issues, or topics not directly related to the presentations. The scalability wrap up session will provide an opportunity to discuss general scalability topics and areas for further investigation and collaboration to measure and improve the scalability of Linux on Itanium platforms. To this end, we encourage attendees to share any scalability or general performance concerns, war stories ("wins" are good, too!), unsolved mysteries, work in progress, etc., including a couple of slides/graphs if you think that would be helpful to illustrate the issue.

Notes (pdf, 20 KB)

 

Numerical Computation Tools for Itanium

Matthieu Delahaye, Gelato Central Operations

Shailesh Patel, Gelato Central Operations

If you are working on developing new algorithms (signal processing, voice encoding, etc.), analyzing and visualizing data, or simply performing scientific and numerical computations like matrix operations, several tools are available today to help you. These applications usually manipulate large amounts of data and perform CPU intensive operations. Therefore, the Itanium processor is a suitable platform. We will explore the various solutions (MATLAB, Octave, and Scilab) and offered functionalities, then will present the results of our informal benchmarking/speed comparison tests and discuss the planned evolution.

Presentation (pdf, 2.9 MB)

 

Completing a Successful Migration

Jeff Donsbach, HP

Tips on evaluating, locating, and resolving problems before they happen in migrations will be covered. Information on additional resources on Linux solution migrations will also be presented.

Presentation (pdf, 689 KB)

 

The Itanium Vector Math Library (VML)

Clemens C. J. Roothaan, Gelato Honorary Member

The VML project was conceived in 1990 in parallel with the development of the Itanium. To match the ambitious design of Itanium, all mathematical and computational procedures relevant for the functions at hand were re-examined. Powerful new methods were developed to determine (1) Chebyshev expansions of a function by a straightforward transformation of its MacLaurin expansion, and (2) a Remez expansion from a Chebyshev expansion by a quadradically convergent iterative process.

The functions implemented are (1) reciprocal, division, square root, and reciprocal square root; (2) exponentials, logarithms, and the power function; (3) trigonometric functions and their inverses; (4) hyperbolic functions and their inverses. Actually each VML code is a subroutine that yields a vector of results from a vector of arguments. For corner cases due to pathological input arguments, the VML codes deliver the expected results directly, thereby avoiding error detection by hardware, and subsequent elaborate and costly error processing. Version 1 of VML, comprising 56 functions, is available in the Public Domain. For 55 of these functions, the floating-point performance of the inner loop is 100% saturated; the one exception is the single precision logarithm, which has one floating point vacancy in its 12 cycle inner loop.

The design and implementation strategies of the VML programming model will be shown in detail by a representative example. An earlier presentation entitled "Exploiting the Power of Itanium" on 7/31/2003 in Amsterdam is available at http://www.sara.nl/news/2003/20030813/lecture_roothaan_eng.html.

Spreadsheets (zip, 238 KB)

 

Update on the Perfmon2 Interface

Stéphane Eranian, HP

In this short presentation, we will update the audience about the progress of the perfmon2 interface. What are the latest features on Itanium and other architectures? We will cover the user level tools and Montecito support, and will report on the progress on getting our implementation accepted in the mainline kernel for all major platforms.

Presentation (pdf, 405 KB)