Home
Home > About > Presentations—Gelato ICE | San Jose | April 2007

Presentations—Gelato ICE | San Jose | April 2007

Monday, April 16, 2007

 

Opening Session

Mark K. Smith, Gelato Central Operations

Presentation (pdf, 1.2 MB)

 

Enterprise Linux Workloads on Itanium-Based Servers

Jean Bozman, IDC

Linux usage models continue to evolve in the enterprise space as the Linux operating system matures and becomes more scalable. Over time, more of the enterprise software and middleware long associated with the data center have become available for enterprise Linux server deployments. This session will discuss workloads that run on Linux on Itanium-based servers, including IDC worldwide server data based on customer-based studies of hundreds of IT sites. Top categories of workloads for Linux enterprise servers will be discussed, including database; decision-support; line-of-business applications (ERP, CRM, HR); scientific/technical; and collaborative workloads--in addition to the IT infrastructure and Web infrastructure workloads for which Linux servers are widely known. Technical and business drivers for this usage model will be discussed. Key takeaways will include reasons why Linux and Itanium-based servers are deployed in the enterprise, along with the challenges and opportunities associated with those deployments.

Presentation (pdf, 692 KB)

 

Gelato Federation Lifetime Achievement Award to Clemens Roothaan

Bill S. Worley, Secure64

Clemens Roothaan has made significant contributions to scientific computing throughout his lifetime, culminating in important contributions to the Itanium architecture and Itanium software. We honored Clemens with the Gelato Lifetime Achievement Award during this presentation by Bill Worley. Also included is a file summarizing Clemens's vector math library performance and accuracy tabulations.

Speech (pdf, 23 KB)

HP VML Precision (pdf, 11 KB)

 

Introduction to Virtualization on Itanium Architecture

Peter Chubb, University of New South Wales

César De Rose, Pontifical Catholic University of Rio Grande do Sul

This brief introduction to the ICE Virtualization Track will pose the key questions that, we hope, will be answered by the remaining talks in the track.

No presentation available currently

 

Introduction to GCC Improvements

Shin-Ming Liu, HP

No presentation available currently

 

ISP-RAS Projects on Improving GCC for Intel Itanium Architecture

Arutyun I. Avetisyan, Institute for System Programming, Russian Academy of Science

Andrey Belevantsev, Institute for System Programming, Russian Academy of Science

This talk will describe the work of ISP-RAS on improving GCC for Intel Itanium processors. We will focus on the project of implementing a new aggressive VLIW-targeted instruction scheduler. The basic functionality is implemented and available on the "sel-sche" branch in the GCC repository. We will describe the current work on performance tuning of the scheduler. Other ongoing works and future plans on improving GCC will also be discussed.

Presentation (pdf, 233 KB)

 

LRZ's Recent Altix 4700 Installation

Iris Christadler, Leibniz Computing Centre

The Leibniz Computing Centre (Leibniz-Rechenzentrum, LRZ) is located in Munich, Germany, and is one of three national supercomputing centers in Germany. Their typical users are researchers from universities all over Germany, from all areas of research.

LRZ's flagship, a 4,096 processor Intel Itanium Madison 9M SGI Altix 4700 with a peak performance of 26.2 TFlop/s has been installed in 2006 and will be upgraded to 9,728 Intel Montecito cores beginning in March 2007. The system runs under SLES 10 and uses the PBS Pro batch scheduling system.

Although the system is divided into 16 shared memory partitions with 512 cores each, users may run multi-partition jobs and access up to 3,825 cores at once in everyday use. The application performance of the machine is around 10% of the peak performance; for a Lattice Boltzmann simulation, a sustained performance of 10 TFlop/s was observed for the first time.

This talk will focus on the experience gained when moving from the former pseudo-vector architecture Hitachi SR8000 to the SGI Altix.

Presentation (pdf, 13 MB)

 

Kernel-Based Virtual Machine (KVM) for IPF

Fenghua Yu, Intel

Kernel-based Virtualization Machine (KVM) is a host based VM which was quickly adopted in the Linux 2.6.20 kernel. Currently KVM only supports Intel IA-32 VT-x and AMD SVM processor virtualization technology.

We are porting Intel Virtualization Technology for Itanium (VT-i) to KVM. We are going to talk about technologies for KVM IPF including basic KVM work flow, virtualization interruption handling, instruction emulation, MMU virtualization, guest firmware, etc.

Presentation (pdf, 1 MB)

 

S7 Case Study: Porting Fidelity's GT.M, A High-Performance DBMS, to Itanium

Pankaj Kulkarni, S7 Solutions

GT.M is a high throughput transaction processing database application development system widely used in banking and finance. It includes an optimized compiled implementation of ISO standard M. GT.M also provides a full complement of M tools for creating, link-loading, and debugging source code. This case study describes S7's experience with porting approximately 2 million lines of C and assembly to HP-UX and Linux on IA-64, convering: hand-written assembly conversions (instruction budling, scheduling, speculation, etc.), dynamic machine code generation, Itanium Runtime Architecture migrations, and 32-bit to 64-bit porting issues.

Presentation (pdf, 5.6 MB)

 

Virtualizing the Performance Monitoring Unit

Stéphane Eranian, HP

The performance monitoring unit (PMU) is a crucial piece of CPU hardware used to analyze system and application performance problems. Linux on IA-64 implements the perfmon2 monitoring interface which exposes the PMU to user applications. Several open-source and commercial tools are available on top of this interface.

In the last few months, virtualization has to come to the forefront as a promising new technology. On Itanium, like on other architectures, it is possible to implement virtualization purely in software or using hardware support with VT-i introduced by the Dual-Core Intel Itanium 2 processor.

In this presentation, we will describe why it is important to maintain PMU access in virtualized environments for both guest operating systems and virtual machine monitor. We will cover the requirements for both paravirtualized and fully virtualized environments.

Presentation (pdf, 166 KB)

 

CERN Snippets Dissected

Jose Dana, European Organization for Nuclear Research

CERN and the High Energy Physics community have written millions of lines of C++ code in order to unravel the mysteries of the universe. At CERN openlab, we have extracted some of the most popular methods (inside a class or a set of related classes) and put a test harness around each of them. This allows us, in seconds, to understand how well a given compiler optimizes the code at hand. The analysis is done either via timing or code inspection, which on an in-order processor is quite accurate.

This talk will cover the general methodology and the latest results obtained with GCC 4.3.0. The results will be compared to earlier GCC versions as well as to other compilers available on the Itanium-based platform.

Presentation (pdf, 134 KB)

 

64-Bit Migration to Linux on Itanium Architecture

Jonathan Ward, HP

The objective of this talk is to bring forth the benefits and challenges involved in the 64-bit migration path, in general as well as specific to Linux on Itanium. This presentation will provide insights into the tools and techniques available for aiding 64-bit migration, while also sharing some tips with the programming community on how to avoid common pitfalls.

Presentation (pdf, 270 KB)

 

uProfiler: A Concurrent Profiler for a Concurrent C++

Peter Buhr, University of Waterloo

Justyna Gidzinski, University of Waterloo

Establishing correctness and attaining efficiency with maximal parallelism in a concurrent program is difficult. To aid in the development and understanding of concurrent programs, we are building the uProfiler to display information about the dynamic behavior of concurrent uC++ programs on multi-core and multi-processor Itanium-based systems. This talk presents an overview of the current uProfiler metrics and then focuses on two specific tools that have been recently enhanced. First, the Execution State Transition metric presents the detailed scheduling and blocking behavior of all tasks in the system. Second, the Routine Call Graph metric, displays time and hardware event counts broken down by task and function using either statistical or exact monitoring. Both of these metrics are highly scalable for long-running programs at high resolution of monitoring and provide innovative user interfaces to help developers find performance problems. The uProfiler has performance and features that are comparable to other Linux/Itanium profilers, such as HP Caliper and Intel VTune.

Presentation (pdf, 313 KB)

 

Update on the Gelato GCC Build Farm

Matthieu Delahaye, Gelato Central Operations

This talk will present the Gelato GCC Build Farm: what it is and is not doing, and how to access results or submit new benchmarks. An open discussion on the future feature additions will follow.

No presentation available currently

 

Xen/ia64 Progress and Status

Joseph Szczypek, HP

The Xen/ia64 project has made tremendous progress in the past year, transitioning from a research project into a viable virtualization solution for IA-64. Xen/ia64 now includes most of the features and functionality found in Xen/x86. This includes support for unmodified OSes, using Intel VT-I extensions, as well as paravirtualized support of Linux guests.

This talk will present new features available in Xen/ia64 and some of the significant milestones the project has achieved to make this happen. This talk will also present observations gained while running AIM7 workloads with Xen/ia64, as well as usability and stability observations.

Presentation (pdf, 300 KB)

 

Optimizing Software with Intel Compilers

Eric W. Moore, Intel

This presentation will cover "required" compiler knowledge for accelerating software on Intel Itanium architecture.

Presentation (pdf, 293 KB)

 

Update on Prefetching Work

Zdenek Dvorak, SUSE

This talk will describe the automatic array prefetching pass of GCC. The emphasis of the presentation will be on the design and impact of the new features of the loop array prefetching that should appear in 4.3, including:
- cross-loop reuse analysis and cache modeling;
- non-temporal store generation and usage of non-temporal prefetch instructions; and
- loop transformations to expose more cache reuse.

Presentation (pdf, 250 KB)

 

Xen and the Art of Portability: Porting Xen to the SGI Altix

Jes Sorensen, SGI

What happens when one wants to port Xen to a non-IA-32 architecture, notably when the architecture doesn't look like a PC with a different processor?

Large NUMA systems have very sparse physical memory layouts, no physical memory within the lower 4GB window, and no guarantee of always having a chunk of physical memory in a specific location. In order to handle this, Xen needs to support NUMA aware resource allocations, kernel relocation, etc. in order to perform well on these systems.

This talk will look at where Xen is today on non-IA-32 architectures, in particular focusing on IA-64, and cover some of the tales from the trenches. What can we learn from the work that went into bringing Linux onto more architectures than any other operating system and what was forgotten? What are the types of porting problems with which we are dealing? And, as Xen on non-IA-32 is maturing, how long will it be before it is "production ready," when speaking in buzzword compliant terms?

Presentation (pdf, 345 KB)

 

Determining Chebyshev and Remez Expansions of Transcendental Functions

Clemens C. J. Roothaan, Gelato Honorary Member

It is well known that Chebyshev expansions are more efficient than the Taylor expansion for calculating transcendental functions. For a practical Chebyshev calculation, the accuracy that can be guaranteed depends on two parameters, namely the size of the largest argument and the highest power of the expansion. Hence, it is desirable to have at our disposal for a given function a variety of Chebyshev expansions to accommodate different calculational requirements. The traditional method for determining explicit Chebyshev expansions entails a separate calculation for each pair of parameters, at higher than target precision. The new method proposed here yields the Chebyshev expansions for an entire rectangular grid of parameters in a single process, at target precision. Starting from these "standard" Chebyshev expansions, we also defined and calculated a set of "constrained" Chebyshev expansions, which yield exact results for x=0. A full implementation of both sets for IEEE double precision elementary functions has been carried out on Excel spread sheets. Finally, we also determined analogous "standard" and "constrained" Remez expansions, which are of course the ultimate true "minimax" expansions.

No presentation available currently

 

Locating Optimization Opportunities with VTune

Gary Carleton, Intel

An introduction to software performance analysis with the Intel VTune Performance Analyzer will be presented.

No presentation available currently

 

LinuxOnLinux: A UML Work-Alike for IA-64

Peter Chubb, University of New South Wales

One of the ways to make a hypervisor is to take an existing operating system, gut it of everything non-relevant, and then replace the system call ABI with something more appropriate. This is what Xen/ia64 has done, and also some other hypervisors.

But why not just use Linux as is? Linux already provides mechanisms for trapping illegal operations, for changing memory mappings, and for doing I/O. User-Mode-Linux has been a successful system for using Linux as a hypervisor on IA-32... But what of IA-64?

LinxuOnLinux is a user-space hypervisor that runs a slightly modified Linux kernel, and unmodified user-space. This talk will describe how it works, and the changes we have to make to Linux to make it perform adequately.


Presentation (pdf, 495 KB)

 

Update on Alias Analysis Work

Diego Alejandro Novillo, Redhat/GCC

In this talk, I will describe the status of alias analysis in GCC. The talk will cover the new representation changes designed to reduce memory footprint, improvements in points-to analysis and the ongoing work to interface high-level alias information with RTL (Register Transfer Language, an intermediate representation).

Presentation (pdf, 563 KB)

 

Reading and Interpreting Stall Counters

Murali Vijayasundaram, HP

This presentation will cover the fundamentals of reading and interpreting stall counters on the Intel Itanium architecture.

Presentation (pdf, 75 KB)

 

The Significance of Multi-Core: The Intel Perspective

Xinmin Tian, Intel

Multi-cores such as the Dual-Core Intel Itanium processors facilitate efficient thread-level parallel execution of ordinary programs, wherein the different threads-of-execution are mapped onto different logical cores/physical core on the processors and processors. In this context, several techniques have been proposed for parallelization of programs. In this talk, we will present:
1. An overview of multithreading features in Dual-Core Intel Itanium processors.
2. Revamping high-performance optimizer for multi-core architecture.
3. Thread-level speculation: Intel's finding based on SPEC CPU2006 and accounts for the real-life constraints.
4. Unleash the power of Intel's multi-core processor using Intel compilers and threading tools.

No presentation available currently

 

Multi-Core Programming & Research: The Future of Itanium Architecture

Wen-mei W. Hwu, University of Illinois at Urbana-Champaign

The computer industry is at the stage of divergence in processor design. In the next few years, we will see heterogeneous multi-cores, homogeneous multi-cores, SPE- style accelerators, GPGPU style accelerators, ASIC-style accelerators, and FPGA-style accelerators. In this talk, I will give an overview of the economic forces and technical considerations that will likely drive the direction of research in many-core system programming systems in the next decade. The compiler and software development tools should be equipped with much more advanced bottom-up analysis capabilities and programmer assertions than what they have today in order to have a deep, comprehensive understanding of the real execution constraints of the input program. Such understanding is then used to drive automatic or interactive parallel code generation tools for the diverse set of machine-level programming models required by hardware platforms. I will give an incomplete survey of the field and present early indications that such model for parallel software development is both achievable and desirable.

No presentation available currently

Tuesday, April 17, 2007

 

Itanium Processor Vision and Roadmap

James D. Fister, Intel

The growth of the Itanium processor continues at a steady pace, and the importance of open-source software is also increasing. This session will discuss the vision and direction for the High Availability platform segment, the role of Itanium processors in the platform, the opportunity for open-source software, and the direction the industry should take in developing hardware and software for the marketplace.

Presentation (pdf, 1.5 MB)

 

Update on Superblock Work

Robert Kidd, University of Illinois at Urbana-Champaign

As demonstrated in the IMPACT compiler, performing Superblock formation prior to high-level optimization improves the compiler's ability to specialize code. At the same time, Superblock formation creates large straight line code segments that improve the chances of finding independent instructions, increasing the available instruction-level parallelism (ILP). In IMPACT, this optimization strategy is referred to as structural compilation.

This talk discusses modifications to GCC to move toward a structural compilation model. The first part of this work was to move Superblock formation to GCC's Tree-SSA representation. I will review that work and present the most recent status. I will propose future directions for this work based on experience with IMPACT.

Presentation (pdf, 1.6 MB)

 

An Overview of OpenVZ Virtualization Technology

Kir Kolyshkin, OpenVZ

This talk will present architecture, implementation, and challenges used in the development of the complete OS virtualization implementation on Linux for Itanium architecture. We will cover the isolation, resource control, and virtualization of various kernel structures as well as the live migration of Virtual Environments between physical servers. We'll also look at some of the most important advantages of OpenVZ for Itanium architecture users, including near-native performance (no overhead) and real-time resource reallocation, and how those capabilities are leveraged in the data center.

Presentation (pdf, 550 KB)

 

HPCPI/Xtools Performance Analysis Toolset

David C. P. LaFrance-Linden, HP

HPCPI and Xtools form a separable and cooperative toolset to gain insight into the performance characteristics of applications, systems and clusters.

HPCPI is a sample-based profiler. With some features in common with VTune, Caliper, and Oprofile, HPCPI provides three unique features: (1) it can monitor an arbitrary set of hardware events, automatically placing events in multiplexed groups; (2) the sample collection has been pushed into the lowest levels of the interrupt mechanism, taking less than 20% of the time of other profilers; and (3) it concurrently collects samples while providing event counts to Xtools, allowing both tools to be used simultaneously.

Xtools displays performance-related metrics of clusters and individual nodes. Xclus shows a variety of CPU, memory and I/O utilizations, allowing users to identify imbalances and hotspots. xperf shows a variety of metrics of a particular node, such as instructions per cycle, execution vs. stall breakdowns, and cache penalties, each in history graphs per CPU. This allows users to understand the performance characteristics of applications. xperf can also interact with HPCPI, useful for understanding "where is it doing it" in addition to the "what is it doing."

Presentation (pdf, 252 KB)

Video 1 (avi, 10 MB)

Video 2 (avi, 23 MB)

 

ISP-RAS Activities in Open-Source Software

Arutyun I. Avetisyan, Institute for System Programming, Russian Academy of Science

ISP RAS together with the Russian Federal Agency for Science and Innovations established the Linux Verification Center. The flagship activity of the Center in 2005-2006 was the Open Linux Verification (OLVER) project, which was targeted at formalizing the LSB Core standard and developing a corresponding test suite. In the context of this project, the Center closely collaborated with the Free Standards Group (now the Linux Foundation). This collaboration has transformed into a partnership for joint development of the new LSB infrastructure, activities for strengthening the LSB standard, and improving LSB test coverage. The Itanium-based platform is used at the Center as one of the primary targets for our deliverables.

Furthermore, the ISP RAS compiler team works on improving the GNU Compiler Collection (GCC) for Itanium architecture. An open-source project for defect detection (security vulnerabilities, memory leaks, and other errors) in C\C++ source code will also be presented.

Presentation (pdf, 295 KB)

 

GCC and Osprey Update

Shin-Ming Liu, HP

The Itanium processor is moving toward a multi-core, multi-threaded design, while most existing applications are written for single CPU systems. To extract the best performance out of Linux on the Itanium-based platform, a higher performance compiler tuned for the Itanium processor is essential. The GCC compiler is currently undergoing a major enhancement to deliver this performance, as the compiler of choice for all Linux developers. To meet the performance needs of Linux on the Itanium-based platform in the immediate future, the Osprey Project is a very viable solution. The Osprey Project weds the need for high-performance applications with GCC.

The Osprey Project gathers contributions from multiple universities. It is based on the Open64 compiler with enhancements made by Pathscale and ORC. The compiler is available for Internet download in both source and binary form from www.open64.net. The official Open64 website "www.open64.net" also supports compiler research and development contributions by research institutes and the open-source community. In this talk, we are going to share the latest progress on this project.

No presentation available currently

 

HP Integrity Virtual Machines Technology Overview

Todd Kjos, HP

HP Integrity Virtual Machines (Integrity VM) is a full virtualization solution providing virtual machines with shared CPU and I/O resources. A virtual machine (VM) runs its own operating system instance, and separate VMs on the same physical server can run different operating systems and versions including HP-UX, Linux, and Windows, with other OSs under investigation. This talk will give an overview of Integrity VM, including the virtualization and sharing of CPU and I/O resources, intra-VM security, management of a virtualized system, as well as storage options. The presentation will show how Integrity VM allows customers to increase their server utilization by running more applications on a server, while maintaining application fault and security isolation.

No presentation available currently

 

Extending OpenMP Applications to Clusters Using Intel Cluster OpenMP

Lawrence F. Meadows, Intel

Intel has developed a product called Cluster OpenMP that allows slightly modified OpenMP programs to run on non-shared-memory clusters of 64-bit Intel Architecture processors. This talk will introduce Cluster OpenMP and the modifications required to an OpenMP program, discuss the tools available for porting, debugging, and performance analysis, and present some performance data for Itanium-based clusters.

Presentation (pdf, 465 KB)

 

Service-Oriented Programming: Going Beyond SOA & Bringing Moore's Law to Software Development

Sandy Zylka, NextAxiom

NextAxiom provides a highly-differentiated software development platform that unifies several disruptive computing paradigms: service-oriented architecture (SOA), multi-core hardware advancements, and grid virtualization. With the NextAxiom platform, application components are developed using a serviceoriented programming (SOP) technique which brings Moore's Law to the cost/performance of business and integration software. With SOP, software components are developed semantically as services, which are themselves strictly built on top of other services. These extraverted components integrate from the inside-out and are automatically multi-threaded and virtualized at a molecular level across multiple cores, processors, and servers at runtime. Intel Itanium 2-based systems provide an ideal architecture for the multi-core performance scaling that is automated by the NextAxiom Service Runtime Environment.

No presentation available currently

 

CPU2006 on IPF

Gerolf F. Hoflehner, Intel

In August 2006, SPEC released CPU2006, its latest CPU benchmark suite. This talk will present a performance characterization of the new applications and—for comparison—of the retired CPU2000 suite on dual-core Itanium 2 systems for different compilers. The benchmarks were compiled with the Intel V9.1 Itanium compiler and GCC 4.1 at various optimization levels. Using PMU counter data, a large set of processor parameters such as data and instruction cache misses, branch prediction, and data and instruction TLB misses were analyzed. As an outcome, the talk will give insight into the performance bottlenecks and the impact of compiler optimizations on benchmark performance for CPU2006.


Presentation (pdf, 100 KB)

 

Compiling Debian Using GCC 4.2 and Osprey

Martin Michlmayr, Debian

This talk will describe how Debian has been used as a real-world test suite for the GCC 4.2 and Osprey compilers. Debian is one of the largest Linux distributions and consists of over 6000 packages containing code that needs to be compiled. Through this testing effort, around 20 bugs in GCC and 60 bugs in Osprey have been identified and reported, many of which were related to Itanium architecture support in the compiler.

Presentation (pdf, 57 KB)

 

BladeSymphony with Virtage

Paul Figliozzi, Hitachi

Mr. Figliozzi will provide a basic overview of Hitachi's BladeSymphony 1000 Enterprise blade server system with the Virtage embedded virtualization feature. Mr. Figliozzi will specifically emphasize the interaction of the BladeSymphony multi-blade SMP feature, Itanium Montecito processor blades, and Virtage logical partitioning collectively applied to solving real world business challenges. A grid computing application scenario will be discussed, where the aforementioned technologies were applied toward optimizing performance and balancing workloads in building and operating a parallel distributed source code compilation engine.

Presentation (pdf, 6.7 MB)

 

Deploying Linux/IA-64 in the Telecom Market

Khalid Aziz, HP

As telecommunications equipment manufacturers move to open source for building their solutions, they are increasingly looking at Linux as the foundation for the solution. Linux on IA-64 presents a highly reliable, high-performance platform for Telco applications. This talk will discuss the characteristics of Linux deployments in telecommunication applications and explore a specific deployment in more detail. We will discuss industry efforts to extend Linux functionality to address telecommunications needs and how HP helped a customer build a Telco solution using Linux running on HP Integrity servers.

Presentation (pdf, 115 KB)

 

Interprocedural Optimization Framework

Jan Hubicka, SUSE

The interprocedural optimization framework has been reorganized to work on SSA form for GCC 4.3. This enables significantly better and cheaper function analysis at the interprocedural level. Simple improvements such as optimizing during inlining and improving inlining heuristics lead to noticeable speedup in C++ test cases having large abstraction penalties, which are very common in newer scientific code. In the talk, we will review the changes so far and discuss future planed improvements.

Presentation (pdf, 1.3 MB)

 

Detecting and Solving Linux Application Memory Problems

Tom Archambault, Etnus

Memory problems such as leaks, array bounds violations, and memory-related race condition can lurk even in mature applications. This talk will discuss some of the challenges of memory problems and show how Etnus' memory debugging technologies can help software development organizations build more stable software on the Linux platform. This talk will show participants how they can easily identify existing memory problems, present strategies for analyzing the root cause, and offer suggestions on how memory testing can be introduced into quality assurance procedures. Etnus provides the TotalView debugger specifically for Linux environments, including those running on IA-64 processors. Advanced visual analysis, collaboration features such as dynamic HTML-based reports, and powerful scripting languages give distributed teams just what they need to work effectively on memory problems.

No presentation available currently

Basic Intel Itanium Architecture

Cameron McNairy, Intel

Eric W. Moore, Intel

The Itanium architecture and the paradigm of explicit parallel instruction computing (EPIC) are often poorly understood. This presentation will cover important aspects of the EPIC paradigm, including software pipelining, register save engine, predication, parallel instruction groups, data and control speculation, and many other mysteries of the Itanium-based application and system architectures.

Presentation (pdf, 333 KB)

 

Discussion: Synchronization and Memory Ordering on Intel Itanium Architecture

Cameron McNairy, Intel

This session will be a discussion of synchronization and memory ordering for Itanium architecture.

No presentation available currently

 

Update on LTO

Kenneth Zadeck, NaturalBridge, Inc.

The GCC community has started development of link-time optimization (LTO) within the framework of the existing compiler structure. This talk will describe the objectives, the initial design, and the progress on the implementation.

Presentation (pdf, 131 KB)

 

Machine Check Architecture

Russ Anderson, SGI

The Itanium processor has machine check architecture (MCA) to provide error reporting and recovery from hardware errors detected by the processor or chipset. This presentation will give an overview of the MCA foundation and focus on Linux kernel recovery from memory and cache uncorrectable errors encountered by applications.

Presentation (pdf, 468 KB)

 

Work in Progress: NFS Research at UNSW

Peter Chubb, University of New South Wales

In San Jose last year, I presented some very preliminary traces of real NFS traffic, showing the kinds of performance problems that can occur in real-world situations. I also implored people for access to more traces.

Since then, we've been working mostly to improve the ability to capture traffic. Instead of the old NFSdump program, which had problems with TCP traffic streams and also with anything that caused packet fragmentation, we've put a plugin into Wireshark. We're working on a new, more robust, replay engine. And we're also exploring client-side scalability.

I can't say now where we'll be up to in April; you'll have to come to the talk to find out!

Presentation (pdf, 190 KB)

 

Caliper: HP's Performance Tool

Murali Vijayasundaram, HP

Stephen Williams, HP

This talk will present some of the advanced new features introduced in the latest release of HP Caliper, HP's premier performance analysis and tuning tool for Itanium-based systems.

HP Caliper runs on multi-vendor Linux/Itanium-based systems (as well as HP's HP-UX Integrity systems) and can be used through a full-featured and intuitive (Eclipse-based) GUI as well as a command-line interface. Utilizing the Itanium processor's PMU (Performance Monitoring Unit) hardware, HP Caliper performs light-weight & non-intrusive measurement of a large set of performance metrics on optimized, production applications written in C/C++/FORTRAN/Java/assembly.

Besides the overview of the tool, this talk will cover the new call stack profiler, which captures and graphically displays complete program call chains accounting for much of the program run time, and a new traps/faults/interrupts profiler, which uses advanced capabilities of the Montecito PMU to display the impact of the behind the scene traps (e.g. unaligned memory references) on the application runtime. Caliper's graphical user interface will also be used to demonstrate the use of the above measurements in applications.

Presentation (pdf, 585 KB)

 

On-The-Fly TLB Generation to Realize Variable Page Size Support

Christoph Lameter, SGI

Variable page size support has been a challenge for Linux on IA-64. The short VHPT format is used allowing one page size for each region supported by the processor. In many cases, we are restricted to a 16KB page size. However, IA-64 has programmable exception handlers that can manufacture and install custom TLB entries for regions not using the TLB lookups via the VHPT hardware walker.

It turns out that there is something similar in use for Region 6. The TLB generator there is used for 1-1 mappings of the kernel data area. One TLB entry is generated for each 16MB of memory. One could install a small "TLB interpreter" in the fault handler that checks unused address bits to generate TLB entries in a flexible way. We show such a scheme to allow TLB generation with a custom page size of up to 1GB.

These variable page sized mappings are then used to provide a more efficient mapping for the memmap structure. The memBap structure is placed in the kernel virtual memory area which uses 16KB page size. Increasing the page size to 16 MB allows the use of a single TLB entry for each node's memmap and improves overall performance.

Presentation (pdf, 322 KB)

 

Numerical Experiences with High-Speed Linear Solvers on Itanium 2-Based Computers

Hugo Daniel Scolnik, University of Buenos Aires

Our research is focused on developing highly efficient parallelizable solvers for huge systems of linear equations that arise from finite element discretization of complex nonlinear engineering problems. Those problems are nonlinear, require many linearizations, and hence several days of CPU time on Itanium-based platforms. We developed a solver and tested it with different systems from different problem domains. We also developed new preconditioners to apply to these systems. In this presentation we will show the results obtained with the solver, the new preconditioners as well as the comparison with traditional techniques.

Presentation (pdf, 420 KB)

 

Optimized Itanium Binaries Using GCC and Related Tools

Diego Alejandro Novillo, Redhat/GCC

This presentation will provide an overview of the different flags controlling optimization in GCC and some hints on how to use them to get the best performance out of your application. In particular, it will present some of the optimization features that will benefit Itanium-based applications and a set of GNU tools that can be used to analyze and tune the quality of the generated code.

Presentation (pdf, 341 KB)

 

Hyper-Threading on the Dual-Core Itanium 2 Processor

Rohit Bhatia, Intel

Hyper-threading on the Dual-Core Intel Itanium 2 processor provides what appears as two logical processors for each core. As a result, the benefits of hyper-threading are available to an application automatically. However, there are things that the application and operating system can do to optimize for hyper-threading. This presentation introduces hyper-threading and then transitions into what software can and should do to best realize performance.

Presentation (pdf, 887 KB)

 

Linux Kernel Development for Itanium Architecture

Andrew Morton, Linux 2.6 Kernel Maintainer

In this keynote, we will discuss the Linux kernel development processes from the point of view of the community for Itanium architecture.

The kernel effort is of course dominated by x86 and x86_64 desktop and server influences, and this can at times compromise support for less popular and more specialized architectures. We will examine the forces at play here, potential areas of conflict, and the means by which these are being resolved.

We will also look at ways in which the Itanium kernel development and user community can more effectively work with and influence the wider kernel development team.

Presentation (pdf, 124 KB)

Wednesday, April 18, 2007

 

Itanium, A Popular Platform for Large Systems

Wim Coekaerts, Oracle

Why is Itanium so heavily used in Linux benchmarks? Itanium doesn't seem to have the big market share and reputation however most benchmarks done with Linux are using Itanium-based systems. Why is that the case? This talk will look at what makes it such a popular platform for large systems.

No presentation available currently

 

Looking Ahead to Gelato ICE Singapore 2007

Jon Lau, National Grid Office

This presentation will highlight the next Gelato ICE: Itanium® Conference & Expo to be held September 30-October 3 in Singapore.

No presentation available currently

 

Parallel Programming Concepts

Gary Carleton, Intel

This presentation will refresh attendees' memories on parallel programming concepts, advantages, and pitfalls.

No presentation available currently

 

Future Challenges to Linux Scalability

Andrew Morton, Linux 2.6 Kernel Maintainer

In this presentation, we will examine the progress which Linux has made in recent years in its support for large systems. Some of the remaining problem areas will be identified and we will examine the work which is occurring to address them.

We will also discuss steps which the Itanium kernel development and user community can take to help accelerate this part of the kernel's evolution.

Presentation (pdf, 99 KB)

 

Solaris/SPARC Applications Running on Linux/Itanium-Based Systems

Ian Robinson, Transitive

Transitive has developed an application translation solution that allows Solaris/SPARC applications to run on Linux/Itanium-based systems without modification to source code or binaries. This session will cover accelerating application migration to Itanium-based platforms based on our experiences at enterprise customer sites. This session will also include a technical overview of Transitive's unique QuickTransit architecture, which allows migrated applications to run at near-native performance while avoiding the cost and delays of porting projects. Incidentally, this is the same core technology that is used every day by more than 4 million Apple Mac owners when they run PowerPC applications (such as Microsoft Office) on the new x86-based Macs.

When deployed in conjunction with virtualization solutions (such as Xen), QuickTransit makes it possible for customers to consolidate the workloads from multiple legacy SPARC servers onto a single powerful Itanium-based system, thereby enjoying considerable cost savings and eliminating legacy hardware risk.

Presentation (pdf, 2.5 MB)

 

Understanding Linux NUMA Memory Allocation Policies

Lee Schermerhorn, HP

The performance of applications on a NUMA platform can be greatly affected by the location of memory pages relative to the processors accessing that memory. Placement of memory pages on a NUMA platform running Linux are controlled by "memory policies." The default policy is quite easy to understand: a page will be allocated on the node where the CPU from which the allocation is requested is attached. Other, explicit memory policies are a bit more complex--perhaps even counter-intuitive. Application programmers and users are sometimes surprised by the effects of a policy that they specified.

This presentation will describe the semantics of the Linux APIs (programming interfaces) and CLI (command line interface) available to explicitly control memory placement. It will attempt to provide an understanding of how the mechanisms work and why.

Presentation (pdf, 208 KB)

 

Development of a Compiler for Modulo Scheduled Itanium Codes

John H. Detrich, Independent Contractor

In large scale scientific and engineering computer applications, one frequently encounters critically important tasks where modulo scheduled execution can achieve unprecedented efficiency. Itanium architecture was specifically designed to facilitate the construction of modulo scheduled codes. An important practical implementation in this domain, targeting the mass-production of elementary functions, is the vector math library (VML) released by HP in the public domain. Drawing on our VML experience, we will report on our efforts to construct a compiler to implement modulo scheduled codes. We will specify the appropriate strategies, tools, and nature of the source language, and present a representative example in detail, with a description of the stages in the compilation.

No presentation available currently

 

OpenMP

Eric W. Moore, Intel

This presentation will introduce OpenMP (Open Multi-Processing), a simple and flexible interface for developing shared-memory parallel applications.

Presentation (pdf, 587 KB)

 

Genuinely Secure Systems

Bill S. Worley, Secure64

Secure64 Software Corporation has developed a micro-OS called SourceT. SourceT was designed from the ground up to provide the foundation for systems with a strong set of security properties. Such systems are called "Genuinely Secure." The Intel Itanium architecture for the first time provides capabilities that enable a genuinely secure system. The talk will define and discuss the properties of a genuinely secure system, the innovative uses of unique Itanium processor capabilities, and the synergy with other emerging trusted system technologies such as VMMs, VT, VT-d, LT, and TPMs. The talk will also explain why a genuinely secure system cannot be built on non-Itanium-based platforms.

Presentation (pdf, 744 KB)

 

Hardware Profile-Guided Automatic Page Placement for ccNUMA Systems

Jaydeep Marathe, North Carolina State University

In contemporary ccNUMA systems, accesses to local physical memory incur significantly lower latencies than accesses to remote memory. Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly impact the overall wall-clock execution time.

In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access those pages. Our scheme uses performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e. the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation. Furthermore, it requires no special compiler, operating system, or network interconnect support.

We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time savings of more than 20% for our benchmarks, with a onetime profiling overhead of 2.7% over the overall original program wall-clock time.

Presentation (pdf, 744 KB)

 

Kexec/Kdump on IPF

Nanhai Zou, Intel

Kexec is a Linux kernel mechanism to boot to another kernel from a running kernel. Kdump uses Kexec to quickly boot to a dump-capture kernel whenever a dump of an entire system is needed (for example, when the system crashes). This presentation will introduce how Kexec/Kdump is supported in Linux IA-64.

Presentation (pdf, 751 KB)

 

Intel Thread Checker

Gary Carleton, Intel

Matthieu Delahaye, Gelato Central Operations

Discover and learn how to speed up your development process by detecting non-protected shared memory access, even when no actual race conditions occurs.

No presentation available currently

 

Process Scheduling on 1024 Processors

Christoph Lameter, SGI

The question of where the boundary for scaling monolithic operating systems "such as Linux" is and has been an unanswered question. Recently, we had to deal with systems at 1024 nodes, 1024 IA-64 processors, and a couple of Terabytes of memory. The Linux scheduler did become an issue during the testing of these systems.

This presentation will discuss the basics of the Linux scheduler and how it was designed to work on SMP systems, and then will explain what was done for scheduling in very large systems. Changes were necessary because data placement issues caused long latencies, global scans over all processors could potentially livelock the system and the scheduler stopped load balancing in some scenarios with processes pinned to processors.

Presentation (pdf, 299 KB)

 

The Itanium PAL (Processor Abstraction Layer)

Tony Luck, Intel

Like all good software structures, the Itanium software stack follows the "lasagne" model with several layers of software abstracting away implementation details of lower levels while providing higher levels of functionality at the upper levels.

The processor abstraction layer (PAL) sits at the bottom of this stack. This talk explains how the Linux kernel interacts with the PAL.

Presentation (pdf, 208 KB)

 

OpenUH: Exploring Language Constructs and Their Implementations

Barbara Chapman, University of Houston

Several years ago, we began to explore using the Open64 compiler infrastructure to support a variety of research and teaching interests related to parallel computing. Since then, we have used the compiler to experiment with real-world technical programs and have enhanced its features to help us translate and explore the potential for optimizing OpenMP programs to run on a variety of different platforms. In this presentation, we will discuss our experiences using this compiler suite to create OpenUH, our branch of Open64, and our plans for future work.

Presentation (pdf, 1.26 MB)

 

Intel Thread Profiler

Gary Carleton, Intel

Matthieu Delahaye, Gelato Central Operations

Learn how to understand the overall performance of your application, how to identify which part of your code should be parallelized, and how to reduce the impact of the overhead induced by thread creation, locking, or barriers.

No presentation available currently

 

Parallel Programming Tools for Multi-Core, SMPs, and Computational Clusters

Alexander Moskovsky, Program Systems Institute, Russian Academy of Sciences

PSI RAS conducts research in the field of tools and languages that are designed to simplify parallel programming for multi-cores, SMPs, and computational clusters. We employ C++ language as a basis; language extensions have been developed and, more recently, a C++ template library. The parallelism constructs are implicit in both cases that allow writing compact programs and enable additional features like fault-tolerance and portability. The dynamic load balancing and parallelism granule size control enables runtime program adaptation to the hardware in use. Other by-product benefits of the approach allow visualized traces of execution and synthesis of Web services for computational tasks.

No presentation available currently

 

Distribution Panel: Redhat, Debian & openSUSE

Prarit Bhargava, Redhat

Nathan Conger, Novell

Dann Frazier, HP

Presentation of Redhat (pdf, 641 KB)

 

Intel's Commitment to Linux

Sunil Saxena, Intel

This talk will provide you with a high-level roadmap for the IPF platform and what to expect from the platform in the future. You will learn how Intel is driving processor features to move Linux into high-end, mission-critical systems; driving performance to make Linux as server OS choice; and working with the Itanium Software Alliance to either drive or enable Linux to the high-end. You will learn about Intel's contribution to Linux in the area of data center tools, RAS features for Linux, solutions stacks availability, and performance optimization. We will provide some of the leading performance data that is publicly available for IPF. We will also cover virtualization solutions for IPF Linux as it has become a requirement in data centers. We will end this talk with a call to action for the gaps for Linux for IPF.

Presentation (pdf, 2.65 MB)

 

Update on Open|SpeedShop: An Open-Source Performance Toolset for Linux Clusters

Martin Schulz, Lawrence Livermore National Laboratory

Open|SpeedShop is an open-source performance tool project targeting Linux workstations and large scale clusters, including Itanium-based systems. It provides many essential performance analysis steps in a single, unified framework and is designed for both novice and advanced users. Basic functionality can be used without a large learning curve while more advanced functionality is optionally available throughout the tool. Users can access Open|SpeedShop through a comprehensive GUI, command line and batch language, or Python module.

Open|SpeedShop is extensible through (potentially commercialized) plugins that provide additional data collection or visualization. This can be used by advanced users to customize their performance analysis environments or can serve tool builders with an easy way to build and deploy new tools.

In this talk, I will introduce Open|SpeedShop, present its main features and provide an update on the project's progress. In addition, I will give a brief tutorial showing how users can deploy it for their own performance analysis projects and I will discuss how everybody can contribute to this open-source effort.

Presentation (pdf, 3.09 MB)

 

Itanium 2 and Montecito Microarchitecture

Cameron McNairy, Intel

The Itanium architecture requires the compiler to schedule instructions well for performance. The Itanium 2 and Dual-Core Itanium 2 9000 series processors provide many opportunities for code generators to obtain performance. This talk will explain some of the key aspects of these processors in an effort to enable code generators to schedule instructions well and achieve high performance. Topics will include instruction fetch and dispersal, execution, and the memory subsystem, along with key dos and don'ts.

No presentation available currently

 

Setting Up a Quiet Itanium-Based Home Office for Fun and Profit

David Mosberger, Gelato Honorary Member

In this brief presentation, David will share some of the trials and tribulations he went through in planning and implementing a home office that is cool, quiet, power-efficient, yet provides convenient access to several desk-side and server-type computers. One computer in particular was a challenge since it sounded like a vacuum cleaner, even when powered off! To learn about the solution to this and other problems and what they have to do with a fully automated satellite-dish heater, please attend this talk.

Presentation (pdf, 537 KB)