Executable and Resource-Predicting Application Models for interventional X-Ray

Madhura Korikkar
0827592

Master Thesis
Eindhoven University of Technology
Department of Mathematics and Computer Science
System Architecture and Networking
and
Manipal Institute of Technology
Department of Information and Communication Technology
Network Engineering

Graduation Supervisor:
prof.dr. J. J. Lukkien
Chair, System Architecture and Networking
j.j.lukkien@tue.nl

Supervisor at MIT, Manipal:
prof.dr. Radhika M. Pai
radhika.pai@manipal.edu

Tutor:
ir. A. Visser
System Architect
Philips Healthcare, interventional X-ray
alex.visser@philips.com
Abstract

With the advent of improving technology, advanced solutions are provided to healthcare professionals. Through combining human insights and clinical expertise, patient outcomes are improved while lowering the burden on the healthcare system. Interventional X-Ray is responsible for the development of X-Ray systems used for diagnosis and interventional treatment of cardiac and vascular diseases. Philips Healthcare is the market leader in this field.

In Cardiovascular minimal invasive interventions, physicians require low-latency X-Ray imaging applications, as their actions must be directly visible on the screen. This requires sufficient performance of the image-processing system while executing a plurality of functions. Because dedicated hardware lacks flexibility, the functionality of such systems is increasingly based on commercial technology like PC hardware, Operating Systems(OS) and state of the art software languages. Therefore, the development of this type of systems is performed in multidisciplinary teams where hardware, mechanics and software come together.

A clear trend within interventional X-Ray applications is a complete integration of diagnostic features within X-Ray interventions, to support earlier feedback in the clinical work-flow. Interventional image processing and other diagnostic applications are combined in one system design and executed simultaneously in various combinations.

A consequence of the sketched trend is that multiple applications are executed in parallel and that the PCs used for these systems have to offer sufficient performance under various use-cases. The latter has to be verified for each X-Ray system that is assembled during production. However, adequate prediction and verification of the components based on modeling beforehand saves time and investment. They can also be given to hardware vendors for use in the selection process.

In addition to this, models can be used to decide the optimum combination of applications on a specific hardware platform.

The goal of the work is therefore to make models of applications that admit prediction of their execution behaviour on a given platform. These models will be both descriptive
and executable. More precisely, the work includes the following tasks:

- Define the technical context (assumptions, requirements, and existing work) for this prediction and verification

- Define the particular requirements for the workstations within the X-Ray system and relevant parameters

- Collect sufficient measurement data of the resource usage of the applications to develop descriptive models. These models predict the application behaviour sufficiently precise on a given platform

- Develop executable models, based on micro-benchmarks, to stress test PC workstations for their intended use, without the actual use of the medical software application. This stress test shall verify the actual performance for a specific instance of a platform. Under-performing PCs can then be excluded from the actual assembly.
Acknowledgements

This thesis is written as part of the master project that I have performed under the chair of System Architecture and Networking at Eindhoven University of Technology, The Netherlands, in cooperation with Philips Healthcare, Best, The Netherlands.

Upon completion of my master thesis, I would like to extend my heartfelt gratitude to everyone who has contributed to the successful completion of this work. My venture with this project has been extremely motivating and interesting. I have been able to learn and explore and broaden my horizon through this project. Therefore, first and foremost, I would like to express my utmost gratitude to my graduation supervisor, prof.dr. Johan J. Lukkien who gave me the magnificent opportunity of working at Philips Healthcare. He has been a terrific guide and motivator. Discussions and interactions with him have helped me think and explore things on a deeper level. I thank him for his encouragement, guidance, recurring feedbacks and for helping me shape my thesis into what it is today. I would also like to express my deepest and sincere gratitude to my tutor and supervisor at Philips Healthcare, Alex Visser who has been guiding and helping me to think independently and providing his valuable time and feedback. I also thank him for coming up with the idea of Sounding Board meeting at Philips Healthcare. These meetings helped me develop my insight and understand my work better. He inspires me on both, a professional and a personal level. I would also like to thank my mentor at TU/e, dr. Rudolf H. Mak for his valuable suggestions and feedback throughout the execution of the project. He has given new insights to this project during the Sounding Board meetings which helped me improve my work.

I extend my gratitude to prof.dr. Mark van den Brand, the program director for the Manipal-TU/e dual degree program, for being a part of my assessment committee in my project and most importantly for selecting me, along with prof.dr. Manohara M. Pai and prof.dr. Radhika Pai, to do my masters in this prestigious institute. I express my heartfelt gratitude to them for giving me this opportunity. I would also like to thank my supervisor from Manipal Institute of Technology, prof.dr. Radhika Pai for her valuable evaluation and providing me support on a personal level. I would also like to express my sincere respect and gratitude to Pascal Wolkotte, Mike Creemers, Bart Van Rijnsoever for being a part of the Sounding Board committee at Philips Healthcare and providing
useful suggestions, feedback and support in many circumstances. I also thank Philips Healthcare for giving me a new light and thinking in my career and all my colleagues there who helped me with my thesis directly or indirectly.

I would like to thank my friends who stood by me at all times. Finally, I would like to thank my father, Subrahmanya, my mother Lakshmi and my brother, Sumanth for their unconditional encouragement and support. They are the butter to my bread and the breath to my life.
Nomenclature

AVX  Advanced Vector Extensions
BIOS Basic Input Output System
CPU  Central Processing Unit
DIMM Dual In Line Memory Module
DMI  Direct Media Interface
DRAM Dynamic Random Access Memory
ESI  Enclosure Services Interface
ICH  I/O Controller Hub
IMC  Integrated Memory Controller
IP   Image Processing
LSI  Large Scale Integration
NUMA Non Uniform Memory Access
PCH  Platform Controller Hub
PCIe Peripheral Component Interconnect Express
PCM  Performance Monitor Counter
QPI  Quick Path Interconnect
RUM Resource Usage Measurement
RUMM Resource Usage Modeling Measure
SAS Serial Attached SCSI
SATA Serial Advanced Technology Attachment
SCSI Small Computer System Interface
TDP Thermal Design Power
USB Universal Serial Bus
Chapter 1

Introduction

1.1 Introduction

In today’s era of technological boom and revolution, the world has witnessed countless number of wonders in every aspect of life. Medicine and patient health-care is one such field which has saved millions of lives and continues to evolve with the technology. Today it is possible to cure several ailments related to heart, brain and vascular tissues using angiography generated by equipments that work in cord with very efficient and highly powerful PCs. The clinical use of X-Ray imaging for angiography is undergoing a gradual paradigm shift from surgery to intervention. Interventional treatments use X-Ray images during the procedure which demands eye hand co-ordination of the physician and imposes real time requirements on the image processing including low latency. A clear trend in this aspect is therefore the integration of diagnostic and intervention X-Ray imaging to provide earlier feedback in the clinical work-flow and possibly provide guidance to intervention[1]. Allura X-Ray system is one such sophisticated system from Philips Healthcare designed and developed for both diagnostic and interventional purposes. Therefore, it is not only essential, but inevitable for this system to perform with high reliability. This necessitates the use of extensive and elaborate tests during production, assembly and release of the product. In spite of the series of tests conducted, some systems are known to fail due to hardware failures. One goal of this project is to develop executable models that can be run on a system and predict failures and determine its nature. Once these models are created and shipped to the manufacturer of the PC, from the business perspective, it achieves significant cost reduction by avoiding the shipping costs to the manufacturer of the PC.

Allura X-Ray system uses different PCs for different functionalities. Though each of these PCs are of immense use in their distinct ways, combining two or more PCs into one PC is a huge capital saver. Taking this as a motivation, another goal of this project
is to develop descriptive models that can be used to predict the performance of an
application, or many combined applications on a PC with known hardware properties.
This can be a starting point to combine two PCs into one, or predict the performance
of a PC even before the PC is available from the vendor, just based on the architectural
properties.

The outcome of this project is a set of descriptive and executable models that predict the
performance of a PC. This report gives a detailed description of the design, development
of such models and interesting observations and inferences made during the execution
of this project.

1.2 Understanding the Allura-Xray System

The development of performance models is based on the sound understanding of the
Allura X-Ray system. As a ground step, this section discusses basics related to the
Allura X-Ray system.

1.2.1 Biomedical X-Ray imaging

Biomedical X-Ray imaging has proven to be an indispensable component of many med-
ical diagnostic and treatment techniques. X-Ray has been used for biomedical imaging
ever since Roentgen X-rayed his wife’s hand in 1895 after he discovered it[2]. The speed
with which the technology was implemented in medicine is phenomenal by today’s stan-
dards. An important reason for this rapid acceptance is probably that it was the first
time that objective evaluation of an internal organ could be made (without opening the
body or surgery) to act in a meaningful way, and the other reason is probably that the
technology for making X-Rays and recording the X-Ray images were available because
they were already in wide spread use for other purposes.

An X-Ray image is created by radiating the desired part of the human body and captur-
ing the remaining radiation at the opposite side of the body part in an X-Ray detector.
Since its discovery, X-Ray has been used to detect bone fractures, gunshot wounds and is
now culminating in X-Ray radiography, computed tomography , fluoroscopy and radio-
therapy. X-rays can also be used for minimal invasive interventional treatments of heart
and brain. Minimal invasive procedures are performed through tiny incisions instead of
one large opening used in conventional open surgeries. Because the incisions are small,
patients tend to have quicker recovery times and less discomfort than with conventional
surgery.

With minimally invasive interventions, cardiologists diagnose and treat a coronary artery
disease, using a catheter inserted into the groin and threaded through the arterial vessel tree to reach the heart. Also, radiologists use these systems to treat aneurysms, thromboses by inserting catheters in the veins. Treatment may be in the form of balloon angioplasty (compressing the plaque against the wall of the vessel), stenting (inserting a small wire tube). A radio opaque contrast fluid which is injected shows the structure of the lumen of the blood vessel through which the catheter passes and pinpoints to blockages or narrowed arteries that need treatment. The contrast fluid blocks the X-Ray and creates contrast between the blood vessels and its surroundings. This principle is used in angiography. Figure 1.1 is a good example of X-Ray angiography image.

![Figure 1.1](image.png)

Figure 1.1: (a,b) A narrowed coronary artery of the heart. (c) A stent is inserted through a balloon catheterization. (d) The balloon is inflated, expanding the stent and widening the artery. (e) The stent holds the artery open (Albers, 2010).

### 1.2.2 The Allura X-Ray System

Allura X-Ray systems are interventional systems from Philips Healthcare honed to meet precise cardiovascular, electro-physiology, radiology, neuro-radiology needs. The system has the flexibility to handle a wide range of mainstream diagnostic and interventional procedures with exceptional clarity and deep insight. Figure 1.2 shows an Allura X-Ray system used in a hospital for an interventional treatment. The Allura system has a C-arm that rotates in a curved trajectory around the patient. One side of this arm generates X-Rays using a generator and the other end captures the images using a detector. Physicians perform minimal invasive interventional procedure by viewing the X-Ray images of the patient on the monitor.
Data flow

The functionalities of the system are explained in brief using a data flow representation \[4\]. Figure 1.3 shows a conceptual dataflow diagram for Allura X-Ray system.

![Dataflow diagram for Allura X-Ray system](image)

Figure 1.3: Dataflow diagram for Allura X-Ray system (Albers, 2010)

The blocks shown in Figure 1.3 are briefly described below.

- **X-Ray detector**: This is the module that captures the X-Ray images generated once the radiations pass through the patient. The C-arm has both the generator and the detector.

- **Image Acquisition**: The detector sends the data to the image acquisition module...
where an X-ray image is constructed. This module does preprocessing on the acquired images before they are sent to the image processing module for further processing.

- **Image Processing:** The images are acquired using 2 modes of X-Ray generation which are, fluoroscopy run in which an extremely low dose of radiation is used and exposure run in which higher dose of radiation is used. The image processing module provides functionalities such as noise reduction, contrast enhancement, image analysis, feature enhancement, motion compensation, and image subtraction of the acquired images. These processed images are also stored on disk for future retrievals and analyses. It is important to retain images with extremely high quality using minimum dose of radiation. Image processing techniques ensure sufficient image quality, clarity and resolution.

- **Display Processing:** Display processing is done to display the images on several monitors adorned with menus, and other manipulation functions. It also compensates for differences in monitor contrast and brightness.

- **Display:** This module is the display module which includes multiple small monitors or a single high resolution huge monitor supported by the Flex-vision PC.

- **Machine Control:** This module enables the physicians to rotate the C-arm and move the table.

- **User and Service interface:** This module enables the physicians to configure the displays and manipulate images shown on the displays. The physicians can switch images from one monitor to another for better viewing. They can, in the case of Flex-vision, choose from a set of defined templates. The user and service interface enables physicians to manipulate images by zooming, panning, and moving images from one viewing port to another. This module also provides the interface for servicing the X-Ray systems. Through the service interfaces, service engineers are able to configure, test, and fix the product.

- **User Touch Screen:** This module is a touch screen that provides access to the User and Service interface.

- **Storage:** The images detected using fluoroscopy and exposure runs are stored onto a disk for future retrievals and can be used for diagnostic purposes. Picture Archiving and Communication Systems (PACS) is used to store images in hospitals.
Clinical setting

Figure 1.4 illustrates the clinical work-flow environment used for an Allura system. In this figure, the control room and exam room are shown. The exam room is where the Allura system is placed. X-Rays are generated, detected and displayed on the monitors for physicians and used in interventions. The control room also accommodates monitors for diagnoses and patient information management for physicians and technicians respectively. The technical room (not shown in figure) is where the tier of PCs are maintained using M-Cabinet along with the cooling system.

Allura Xper Product Series

Interventional Allura X-Ray systems include the monoplane and biplane systems. A biplane system provides twice as much information as the monoplane system by providing both frontal and lateral images unlike the monoplane system that provides frontal images.
1.3. Scientific Background

As discussed, it is important for physicians to see X-ray images with very low latency on the monitors when performing cardio-vascular minimal invasive interventions. The Image Processing (IP) part of the system should perform extremely well to meet this requirement. Allura X-Ray system combines powerful PCs to meet the requirements of executing a plurality of functions and ensuring guaranteed performance. Performance of the system depends on hardware components, OS and software used to develop the system. Therefore, it is of critical importance to be able to test applications to be run on PCs to predict in advance if the system could be run successfully, before shipping it to the customer. Developing adequate models to be able to test the system will be indeed helpful and thoughtful.

The PCs used in the X-Ray system run an array of tasks and functionalities in parallel.
However, different PCs with exactly the same hardware specification appear to show different levels of resource utilization for the same application. Out of these, some systems succeed in providing the desired levels of performance and the rest, fail. Regular tests happen not to show this. The aim is to find the distinction between these PCs already before the production phase using a test-set derived from the application. This amounts to developing an executable model of the application, that can be handed to the vendor (or manufacturer of the PC).

If we vary the hardware specification, applications also behave differently, in terms of resource usage. The aim further is to predict the execution of an application, depending on the hardware model. This is useful, for example, for the purpose of selecting new hardware and for predicting combining applications onto a single hardware. To that end, descriptive models are developed in which application and hardware properties are parameters. This idea can be used to combine two different PCs into one PC, to determine the optimum configuration for an application and to predict whether the application can run on a PC which is yet to be developed at the vendor’s end.

In order to determine the resource usage characteristics of an application on a PC with a particular hardware model, performance properties are measured when the application is being run. The measurements are done using a tool developed at Philips Healthcare called the MeasureLoadCLI tool. The executable models are created using the Resource Usage Modeling Measure Tool (RUMM Tool) designed by Ashenafi Gebreweld at Philips Healthcare.

1.4 Problem Definition

The problem definition of this thesis revolves around two different parts discussed in Section 1.3. They are:

1. To develop executable models using resource usage characteristics of an application run on a PC to be able to predict the nature of the PC (good/bad\textsuperscript{2}) before the production phase

2. To develop descriptive models to predict the execution of application(s), depending on the hardware model of the PC

\textsuperscript{1} The RUMM tool is used to generate executable models using micro-benchmark codes derived by creating a profile of application usage and the hardware model chosen.

\textsuperscript{2} A good PC is a term used to denote a PC that meets the performance requirements of the Allura X-Ray system and a bad PC is a PC which fails to provide the desired level of performance. PCs that fail to provide the desired level of performance fail due to one or many problems observed during testing, such as, system crash, corrupt images, network card failure, memory problems. Note that, the models intend not to predict easy to find failures, (e.g. defective video card) but instead try to find systems that do not have a consistent failure pattern and seem to work on first sight (hard-to-find failures).
1.5 Contribution

As discussed in Section 1.4, there are two distinct goals in this project. The research questions that are addressed and the solutions to the questions are listed below.

1. What are the hardware metrics that are responsible for the failure of a PC used in the Allura Xper Xray System?

   • Intensive data analysis: Analysis of several hardware metrics, for example, CPU usage, cache misses, cache hits, to observe unusual behaviour in the measurements captured from the PCs.

   • Creation of executable models: Creation of executable models that predict the performance of a PC, on running it on a PC without the need to run the actual application and extensive suite of tests.

2. What are the possible configurations of a hardware model of a PC and the applications run on that PC?

   • Hardware metric analysis: Analysis of hardware metrics is done to observe hardware resource metric usage for a particular application with distinct characteristics. For example, we use an application that stresses memory buses.

   • Creation of descriptive models: Creation of descriptive models that formulate the performance of the applications on a PC. This serves as a basis to predict the performance of the same application(s) on a different PC (different in hardware specification, for example, change in memory configuration).

1.6 Thesis Outline

The remainder of this thesis is organised as follows. In Chapter 2, the problem definition is elaborated to get a deeper insight into the goals of this project. In Chapter 3, the design and development of descriptive models is discussed. It entails the details about the experiments conducted and reflection done on the measurements collected. In Chapter 4, the design and creation of executable models to predict the nature of the PC are discussed in detail. It entails the details and findings during the analysis of the measurements collected. Finally, we conclude the thesis in Chapter 5 and present the future work of this thesis.
Chapter 2

Problem Description

2.1 Introduction

From Chapter 1 it is known that we are interested in determining and predicting the performance of the Allura X-Ray system. We intend to create executable and descriptive models for performance prediction and evaluation. Therefore, it is important to understand the nature of these models before delving into greater depths. This chapter discusses the type of models, the approach used and formalises the problem definition using mathematical notations.

2.2 Solution Approach

Performance model of the architectural instance of a PC can be used to evaluate the performance of the PC in terms of metrics, such as, bandwidth utilisation, throughput of the interconnection technology observed. In this section, we intend to find an approach that evaluates the performance of applications on a PC.

Performance evaluation of the applications running on any PC can be done using the following approaches.

1. Y-Chart Approach: This is a methodology that provides designers with quantitative data obtained by analysing the performance of architectures for a given set of applications. See Figure 2.1. This is described as follows:

   - Architectural Instance: This can also be called Hardware Model/ Platform model of a PC. It specifies the physical, structural and behavioural properties
Chapter 2. Problem Description

Figure 2.1: Y-Chart Approach

and the performance constraints of the platform used, for example, memory size, bus bandwidth.

- Application Model: This describes the functional behaviour of an application in an architecture-independent manner. It is used to study the target application and obtain estimations of its performance needs, for example, to identify the most expensive computation task. The application parameters are used as input to the application models. The model correctly expresses the functional behaviour, but is independent from the key parameters of architecture constraints, such as, resource utilization or bandwidth limitations.

- Mapping: This is used to map application models on the architectural instance. The mapping helps obtain performance numbers that are useful for performance evaluation.

The light bulbs shown in Figure 2.1 indicate feedback loops for performance optimization by changing application model, architectural instance and the mapping. However, the object of interest of this thesis is performance evaluation only. Therefore, the parts indicating light bulbs are not important in this context. Though this method provides the performance numbers, the tight coupling of the application models to the application demands new application model creation for every change in the application. This could be time-consuming and expensive. Therefore, this method is not used for performance evaluation.

2. Resource Usage Modeling Approach: This is a methodology that models both the architecture and the application together by analysing the resource usage parameters of the application run on the architectural instances of PCs. The methodology uses a set of tools to measure the resource usage parameters of the application and is used to create models that predict the execution behaviour of the application on another architectural instance (platform). This is exactly what we intend to do. This method eliminates the need to create new application models for modified
applications. Therefore, resource usage modeling approach is chosen for the performance evaluation of the PCs. See Fig 2.2. The star sign in Figure 2.2 indicates that whenever an application is modified, only the resource usage measurements change and the tight coupling between the application behaviour and the models created as seen in the Y-Chart approach is resolved. This makes it easy to maintain and easy to evolve with applications.

Resource usage modeling can be done using one or both of the modeling techniques described.

(a) Execution modeling: An executable model is a piece of executable code that resembles the resource usage of an application (also called micro-benchmarks). The executable models are platform dependent.

(b) Descriptive modeling: A descriptive model is a formula that describes performance metrics of choice as a mathematical formula. Hardware properties take the form of parameters in such a formula (for example, memory size, speeds, number of processors).

In this thesis, we intend to use one or both of the above approaches. They can be used to validate each other.

![Figure 2.2: Resource Usage Modeling Approach](image)
2.2.1 System Overview

The Allura X-Ray system meets the performance requirements by using a set of computationally efficient PCs. The system has several types of PCs used for user input, image processing, image detection, image display. Not all the PCs are described in this report for brevity. The PCs that are of interest to this thesis are:

1. Host PC: This is the heart of the entire system and provides an interface to the user to perform tasks, perform safety checks, conduct tests. This is the only PC that has keyboard, mouse and monitor.

2. Image Processing (IP) PC: This is the PC used for extensive image processing. Images are sampled at a suitable sampling rate and processed to store them for diagnostic purposes. This can be considered as the brain of the system.

The IP PCs are more susceptible to performance issues. Tests conducted in the factory reveal that IP PCs fail during performance testing. Also, IP PCs have a very well defined (soft real time), and predictable behaviour, thus must show less variance in the results. Therefore, we study IP PCs.

2.3 Formalizing the Approach

The sequence of steps followed to achieve the goals discussed in Chapter 1 is represented using a formalized approach in this section. This approach is used to introduce the technology and ensure precise semantic meaning of the specifications.

2.3.1 Part I

Part I of the problem definition is: To develop executable models using resource usage characteristics of an application run on a PC to be able to predict the nature of the PC (good/bad) before the production phase.

The set of machines is denoted by $M$. Each machine $m \in M$ is associated with a vector of physical properties describing the details of its construction (e.g. type and make of its graphics cards, NICs). For the purpose of this work, we focus on the amounts and types of resources which we denote as attributes of $m$. Table 2.1 lists the attributes of interest of $m$.

The set of programs is denoted by $P$. Each program $p \in P$ run on $m$ is chosen such that it represents the Allura application.
### 2.3. Formalizing the Approach

Table 2.1: Attributes of $m$

<table>
<thead>
<tr>
<th>Notation of attribute of $m$</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$m.CPUmodel$</td>
<td>The model of CPU in $m$</td>
</tr>
<tr>
<td>$m.CPUfreq$</td>
<td>CPU frequency in $m$</td>
</tr>
<tr>
<td>$m.Core$</td>
<td>Number of logical cores in $m$</td>
</tr>
<tr>
<td>$m.Socket$</td>
<td>Number of sockets in $m$</td>
</tr>
<tr>
<td>$m.Thread$</td>
<td>Number of threads per core in $m$</td>
</tr>
<tr>
<td>$m.L_0$</td>
<td>Size of $L_0$ cache in $m$</td>
</tr>
<tr>
<td>$m.L_1i$</td>
<td>Size of $L_1$ instruction cache in each core in $m$</td>
</tr>
<tr>
<td>$m.L_1d$</td>
<td>Size of $L_1$ data cache in each core in $m$</td>
</tr>
<tr>
<td>$m.L_2$</td>
<td>Size of $L_2$ cache in each core in $m$</td>
</tr>
<tr>
<td>$m.L_3$</td>
<td>Size of $L_3$ cache in $m$</td>
</tr>
<tr>
<td>$m.Mem$</td>
<td>The total DRAM memory in $m$</td>
</tr>
<tr>
<td>$m.MemChannel$</td>
<td>Number of channels from each CPU to DRAM in $m$</td>
</tr>
<tr>
<td>$m.DIMM$</td>
<td>Number of DIMMs per channel</td>
</tr>
<tr>
<td>$m.MemFreq$</td>
<td>Memory Frequency in $m$</td>
</tr>
<tr>
<td>$m.TotalMemBw$</td>
<td>Total memory bandwidth per CPU in $m$</td>
</tr>
<tr>
<td>$m.QPIlink$</td>
<td>Number of QPI links between the CPUs in $m$</td>
</tr>
<tr>
<td>$m.TotalQPIBw$</td>
<td>Total QPI bandwidth between the CPUs in $m$</td>
</tr>
<tr>
<td>$m.Hub$</td>
<td>Number of IO hubs in $m$</td>
</tr>
<tr>
<td>$m.PCIeversion$</td>
<td>Type (version) of PCIe used in $m$</td>
</tr>
<tr>
<td>$m.DiskSAS$</td>
<td>Size of the SAS disk in $m$</td>
</tr>
<tr>
<td>$m.DiskSATA$</td>
<td>Size of the SATA disk in $m$</td>
</tr>
<tr>
<td>$m.GraphicCard$</td>
<td>The number of graphic cards in $m$</td>
</tr>
<tr>
<td>$m.GpuMem$</td>
<td>Total amount of memory in the GPU in $m$</td>
</tr>
<tr>
<td>$m.PCIeGraphicBw$</td>
<td>PCIe bandwidth to and from the Graphic card in $m$</td>
</tr>
<tr>
<td>$m.GrabberCard$</td>
<td>Number of grabber cards in $m$</td>
</tr>
<tr>
<td>$m.PCIeGrabberBw$</td>
<td>PCIe bandwidth to and from the Grabber card in $m$</td>
</tr>
<tr>
<td>$m.InfiniBandCard$</td>
<td>Number of Infini-Band cards in $m$</td>
</tr>
<tr>
<td>$m.PCIeICBw$</td>
<td>PCIe bandwidth to and from the Infini-Band card in $m$</td>
</tr>
<tr>
<td>$m.TurboBoostversion$</td>
<td>The Turbo Boost technology version number in $m$</td>
</tr>
<tr>
<td>$m.maxTDP$</td>
<td>Maximum Thermal Design Power (TDP) in $m$</td>
</tr>
<tr>
<td>$m.lithography$</td>
<td>Lithography in $m$</td>
</tr>
</tbody>
</table>

The notation used is as follows:

- $m_1, m_2, \ldots \in M$ denote machines belonging to $M$
- $p_1, p_2, \ldots \in P$ denote programs belonging to $P$
- $x_1, x_2, \ldots \in X$ denote executable models belonging to $X \subseteq P$
$p_1 \triangleright m_1$ denotes execution of $p_1$ on $m_1$

- $E$ is a function: $E : P \times M \to X$ that is used to create executable models

- $x_1 = E(p_1 \triangleright m_1)$ means that $x_1$ is the model derived from $p_1 \triangleright m_1$

- $k_1, \ldots, k_n \in K$ denote metrics belonging to $K$. Metrics are functions to real numbers (For example, socket usage, socket $L_3$ cache miss, disk read, disk writes are metrics in the context of this thesis). $k_1(p_1 \triangleright m_1)$ gives a metric $k_1$ on this execution. Similarly, $k_2(p_1 \triangleright m_1)$ gives a metric $k_2$ on this execution and so on.

- $C$ denotes a set of classifications of machines into good/bad. $C = \{\text{good, bad}\}$

- $k_{ref_1}, \ldots, k_{ref_n} \in K_{ref}$ denote threshold (reference) values of metrics for a machine based on its nature (good/bad) belonging to $K_{ref}$

A sequence of steps are to be followed to derive useful profiles of metrics of machines, compare the profiles with classification made earlier, create executable models and verify their predictive quality. The steps followed are illustrated in Figure 2.3, Figure 2.4, Figure 2.5 using the mathematical notations presented.

From Figure 2.3, it is seen that a program, $p_1$ (which represents the application) is selected. It is run on the machines, $m_1$ and $m_2$ classified based on their nature ((good/bad) known from use and tests). The classification of machines done in this manner, serves as the ground for developing an executable model and verifying its predictive quality. The MeasureLoadCLI tool is also run in parallel with $p_1$ to record the Resource Usage Measurement (RUM) of the hardware parameters of interest.

Based on the RUM collected, a profile $k_1(p_1 \triangleright m_1), \ldots, k_n(p_1 \triangleright m_1)$ is created for $m_1$ and a profile, $k_1(p_1 \triangleright m_2), \ldots, k_n(p_1 \triangleright m_2)$ is created for $m_2$. A profile is a set of metrics.

The profile of $m_1$ is compared with the nature of $m_1$ (good/bad) to check whether $p_1$ is capable of producing a useful profile. This is done by comparing individual metrics of the profile created for $m_1$, $k_1(p_1 \triangleright m_1), \ldots, k_n(p_1 \triangleright m_1)$ with the threshold values of the metrics of $m_1$ determined based on its nature as shown in Equation 2.1:

$$k_{ref_1} - k_1(p_1 \triangleright m_1) \cap \ldots \cap k_{ref_n} - k_n(p_1 \triangleright m_1)$$

(2.1)

The qualifying condition for $p_1$ to produce useful profile is represented using Equation 2.2 If the weighted sum of the absolute value of the difference between the reference
(threshold values) metrics and the profile of the machine is less than the offset selected, $\delta$, then Equation 2.2 holds.

$$w_1 |k_{ref1} - k_1(p1 \triangleright m1)| + \ldots + w_n |k_{refn} - k_n(p1 \triangleright m1)| \leq \delta$$  \hspace{1cm} (2.2)

- $w_1, \ldots, w_n$ are the weights assigned to the metrics
- $\delta$ is the error offset chosen
This step is extended to develop an executable model, $x_1$ as seen in Figure 2.4:

$$x_1 = E(p_1 \bowtie m_1)$$  \hspace{1cm} (2.3)
In order to verify the predictive quality of $x_1$, carry out the following steps.

\[
E(x_1 \triangleright m_2) = E(p_1 \triangleright m_2) \tag{2.4}
\]
\[
k_1(x_1 \triangleright m_1) = k_1(x_1 \triangleright m_2) \tag{2.5}
\]
\[
\vdots
\]
\[
k_n(x_1 \triangleright m_1) = k_n(x_1 \triangleright m_2)
\]

Equation 2.4 is used to verify the predictive quality of $x_1$ by executing it on $m_2$. The equality relation holds true if the execution of $x_1$ on $m_2$ matches with the execution of $p_1$ on $m_2$, i.e., if Equation 2.6 holds. In Equation 2.6 individual metrics obtained by the execution of $x_1$ on $m_2$ is compared with the corresponding metric obtained by the execution of $p_1$ on $m_2$.

\[
k_1(x_1 \triangleright m_2) = k_1(p_1 \triangleright m_2) \tag{2.6}
\]
\[
\vdots
\]
\[
k_n(x_1 \triangleright m_2) = k_n(p_1 \triangleright m_2)
\]

Equation 2.5 is used to make closer observations and comparison of individual metrics obtained by the execution of $x_1$ on $m_1$ and $x_2$ on $m_2$. The equality relation of Equation 2.5 should hold true when $m_1$ and $m_2$ are of the same nature (e.g., both are good machines) and fail if $m_1$ and $m_2$ are not of the same nature (e.g., $m_1$ is good and $m_2$ is bad). If this condition is satisfied, then it means that $x_1$ is capable of predicting the nature of the machine and the predictive quality of $x_1$ has been verified.

Based on the results of verifications, $x_1$ is formalized as an executable model to predict the nature of a PC before production.

If Equation 2.2 fails, then an alternate program, $p \neq p_1 \in P$ is used to create a useful profile for $m_1$. If Equation 2.2 fails even after repetitive executions of this step, then the MeasureLoadCLI tool is modified to determine measurements of additional/alternate metrics that are to be included in the model. For example, disk access latency, QPI bandwidth are alternate metrics that could be of use, but, not incorporated already in the MeasureLoadCLI tool.

### 2.3.2 Part II

Part 2 of the problem definition is: *To develop descriptive models to predict the execution of the application(s), depending on the hardware model of the PC.*

The first step in realizing this part of the assignment is to define the hardware model of a machine which is represented as a vector of physical properties. For example, the
20 Chapter 2. Problem Description

Figure 2.6: Activity Diagram II, One out of two: Representing the sequence of steps to be followed in creating and verifying the descriptive model

Hardware model of $m_1$ is specified as:

$$\hat{H}_{m_1} = \begin{pmatrix} m_1.Core & m_1.Socket & m_1.Thread & m_1.L_1 & m_1.L_2 & m_1.L_3 & m_1.Hub & \ldots \\ 8 & 2 & 1 & 32 KiB & 256 KiB & 8 MiB & 2 & \ldots \end{pmatrix}$$

$\hat{H}_{m_1}$ is a vectorial representation of the physical properties of $m_1$. It is also the hardware model of $m_1$. Hardware model is similarly defined for $m_2$.

The hardware model is used as a basis to create a descriptive model.
• $d_1, d_2, \ldots \in D$ denote descriptive models belonging to $D$
• $F$ is a function: $F : P \times M \rightarrow D$ used to create descriptive models
• $d_1 = F(K_1, \hat{H}m_1)$ means that $d_1$ is the model derived as a function of $K_1$ generated on $m_1$ using $\hat{H}m_1$. Here, $K_1 \subseteq K$ is the profile created on $m_1$ such that, $\forall k \in K_1, k : p_1 \times m_1 \rightarrow K_1$
• Similarly, $d_2 = F(K_2, \hat{H}m_2)$ means that $d_2$ is the model derived as a function of $K_2$ generated on $m_2$ using $\hat{H}m_2$. Here, $K_2 \subseteq K$ is the profile created on $m_2$ such that $\forall k \in K_2, k : p_1 \times m_2 \rightarrow K_2$.

The steps taken to create and verify the predictive quality of the descriptive models is illustrated in Figure 2.6. In order to be able to study the behaviour of the application on $m_1$, a program, $p_1$, that generates the worst case of load possible on the machine is selected. The MeasureLoadCLI Tool is run in parallel when $p_1$ is run on $m_1$ to capture RUM of hardware parameters chosen and create profile, $K_1$ for $m_1$. The MeasureLoadCLI tool also captures a metric, $k_i(p_1 \triangleright m_1) \notin K_1$.

Using the profile created for $m_1$ and $m_2$, develop descriptive models, $d_1$ and $d_2$.

\[ d_1 = F(K_1, \hat{H}m_1) \quad \text{(2.7)} \]

$d_1$ created using Equation 2.7 is used for prediction. $d_1$ is used to predict $k_i(p_1 \triangleright m_1) \notin K_1$ on $m_1$.

\[ k_{pi} = F(d_1, K_1, \hat{H}m_1) \quad \text{(2.8)} \]

$k_{pi}$ denotes the predicted value of $k_i(p_1 \triangleright m_1) \notin K_1$.

The predictive quality of $d_1$ is verified by comparing $k_i(p_1 \triangleright m_1) \notin K_1$ measured with $k_{pi}$ predicted using Eqn 2.8.

\[ |k_{pi} - k_i(p_1 \triangleright m_1)| \leq \varepsilon_1 \quad \text{(2.9)} \]

If Equation 2.9 holds, then the metric predicted is comparable with the actual value of the metric measured. This verifies the predictive quality of the model, $d_1$. It can be formalized. Taking the error factors into consideration, the metric, $k_{pi}$ is considered acceptable on $m_1$ if it is comparable to $k_i(p_1 \triangleright m_1)$ within an error offset of $\varepsilon_1$.

If Equation 2.9 fails to hold, then, $d_1$ is re-developed using alternate metrics measured.
The steps taken to validate the descriptive model using another hardware platform is illustrated in Figure 2.7. The predictive quality of the model, $d_1$ is validated on another machine, $m_2$. $p_1$ is run on machine, $m_2$. The MeasureLoadCLI and the Perfmon tool are run simultaneously to collect RUM of the hardware parameters which is used to
create a profile of metrics, $K_2$ for $m_2$. The MeasureLoadCLI tool is also used to capture a metric, $k_j(p_1 \triangleright m_2) \notin K_2$.

$$k_{pj} = F(d_1, K_2, Hm_2)$$  \hspace{1cm} (2.10)

$k_{pj}$ denotes the predicted value of $k_j(p_1 \triangleright m_2) \notin K_2$.

The predictive quality of $d_1$ is validated by comparing $k_j(p_1 \triangleright m_2) \notin K_2$ measured with $k_{pj}$ predicted as shown in Equation 2.11.

$$|k_{pj} - k_j(p_1 \triangleright m_2)| \leq e_2$$  \hspace{1cm} (2.11)

Taking the error factors into consideration, the metric, $k_{pj}$ is considered acceptable on $m_2$ if it is comparable to $k_j(p_1 \triangleright m_2)$ within an error offset of $e_2$. If Equation 2.11 fails to hold, then, $d_1$ is not validated on $m_2$. This means that the predictive quality of $d_1$ is limited to machine, $m_1$. $d_1$ is redeveloped taking alternate metrics into consideration. If Equation 2.11 holds true, then the predictive quality of $d_1$ is validated on $m_2$. $d_1$ can be used to predict performance metrics on different hardware platforms.

If the predictive quality of $d_1$ is verified and validated, configurations possible are selected and formalized. For example, plausible configurations are:

$Configuration_1 = (Hm_1, p_1)$

$Configuration_2 = (Hm_2, p_1)$

This implies that it is possible to execute $p_1$ on $m_1$ and $p_1$ on $m_2$ and the properties of $p_1$ on $m_1$ and $m_2$ can be predicted in terms of hardware metrics.
Chapter 3

Descriptive models for resource usage characteristics of applications

3.1 Introduction

Performance prediction and measurement approaches for component-based software systems help software architects to evaluate their systems based on component performance specifications created by component developers. But, prediction methods for performance and reliability of general software systems are still limited and seldomly used in industry. The diverse information needed for the prediction of extra functional properties of a system makes it a challenging task. But, if suitable models can be created to reflect the changing context, such as, different components (components with different specification), different allocation of resources or usage context, they could be used to predict the performance of a system. This can, for example, result in significant cost reduction in a company like Philips Healthcare if the optimum configuration of a system with known hardware platform and usage profile can be discovered using the model created. This chapter summarises the creation of such performance predicting descriptive models using the resource usage profiles of a chosen application run on selected PCs.

The performance of the components in a system is influenced by its usage. The resource demand may vary depending on input parameters (for example, uploading larger files produces a higher demand on hard disk and network). Using this as a key-point, resource usage metrics of an application (input) are used to create models that serve as a basis to predict the performance of the application on a different hardware platform.
Modeling a system using performance metrics is an abstract concept. It covers umpteen number of metrics for an array of hardware types and specifications. It would be indeed interesting to model the system based on several QoS metrics and exploring which metric can be best used as a predictor. Taking the practical and environmental limitations into consideration, we have decided to model the PC at Philips Healthcare using two cases.

1. Case 1: Compare the performance of the same PC with different memory configurations in terms of latency of an application

2. Case 2: Compare the performance of two PCs with different micro-architectures in terms of latency of an application

We will delve into more details of the two cases in different sections. As seen in Figure 3.1, Section 3.2 and Section 3.3 entails the details of Case 1. Section 3.4 and 3.5 entails the details of Case 2. Section 3.6 presents the lessons learned.
3.2 Execution Architecture Design-I

In this section, the application characteristics and requirements to obtain accurate performance estimations of the application on a selected hardware platform are presented for case 1. The execution architecture design encompasses the application properties, hardware model specifications and the mapping of the application on the hardware model.

3.2.1 PC Architecture and Hardware Model

The pictorial representation of the generic hardware model of the PC with Nehalem micro-architecture used for the study in this part of the thesis is shown in Figure 3.2. This hardware model relates to the physical layout of the hardware resources on the motherboard. An instantiated hardware model of a CPU is also depicted in Figure 3.3. The available resources of the instantiated hardware model of the PC chosen, m and its capacity are expressed in Table 3.1.

The PC architecture consists of two CPUs running at 2 GHz. Each CPU has four cores. Each core is connected to a private $L_1$ data and instruction cache of 32 KiB each and a private $L_2$ cache of 256 KiB. Each CPU is connected to an $L_3$ cache of 4 MiB which is shared among 4 cores. A pair of memory buses connect each of the 4 cores to an external memory of upto 3 GiB. There are a total of 7 PCIe slots out of which 4 are connected to 2 network cards and 2 Graphic cards. Each Graphic card has an external memory.

---

$1\text{KiB} \neq \text{KB}; 1\text{KiB}=1024\text{ B whereas } 1\text{ KB}=1000\text{ B. Memory capacity is expressed in KiB and Memory Bus Traffic in KB. Similarly, } 1\text{MiB}=1024 * 1024\text{ B and } 1\text{ GiB}=1024 * 1024 * 1024\text{ B}$
Table 3.1: Attributes of the hardware model of the PC

<table>
<thead>
<tr>
<th>Attribute of machine $m$</th>
<th>Type/Capacity</th>
</tr>
</thead>
<tbody>
<tr>
<td>$m.CPUmodel$</td>
<td>Intel Xeon CPU E5504</td>
</tr>
<tr>
<td>$m.Core$</td>
<td>8</td>
</tr>
<tr>
<td>$m.Socket$</td>
<td>2</td>
</tr>
<tr>
<td>$m.Thread$</td>
<td>1</td>
</tr>
<tr>
<td>$m.L_1$(KiB)</td>
<td>32</td>
</tr>
<tr>
<td>$m.L_1d$(KiB)</td>
<td>32</td>
</tr>
<tr>
<td>$m.L_2$(KiB)</td>
<td>256</td>
</tr>
<tr>
<td>$m.L_3$(MiB)</td>
<td>8</td>
</tr>
<tr>
<td>$m.MemChannel$</td>
<td>3</td>
</tr>
<tr>
<td>$m.DIMM$</td>
<td>2</td>
</tr>
<tr>
<td>$m.MemFreq$(MHz)</td>
<td>800</td>
</tr>
<tr>
<td>$m.Disk_{SATA}$(GiB)</td>
<td>150</td>
</tr>
<tr>
<td>$m.Hub$</td>
<td>2</td>
</tr>
<tr>
<td>$m.PCIeversion$</td>
<td>2.0</td>
</tr>
<tr>
<td>$m.GraphicCard$</td>
<td>2</td>
</tr>
<tr>
<td>$m.GpuMem$(MiB)</td>
<td>512</td>
</tr>
<tr>
<td>$m.PCIeGraphicBw$ (Gbps)</td>
<td>80</td>
</tr>
<tr>
<td>$m.NetworkCard$</td>
<td>2</td>
</tr>
<tr>
<td>$m.PCIeNCBW$(Gbps)</td>
<td>2.5 and 10</td>
</tr>
<tr>
<td>$m.NetworkCard$</td>
<td>2</td>
</tr>
<tr>
<td>$m.InfiniBandCard$</td>
<td>1</td>
</tr>
<tr>
<td>$m.PCIeICBW$(Gbps)</td>
<td>20</td>
</tr>
<tr>
<td>$m.lithography$(nm)</td>
<td>45</td>
</tr>
</tbody>
</table>

IOHub1 is connected to two SATA disks with 16 MiB cache and 150 GiB capacity each.
3.2.2 The Application

As presented, the aim of part II of the thesis is in principle, to create descriptive models that admit the behaviour of the original Allura application. However, this a huge step due to the complexities and plurality of functions implemented in Allura application. Therefore, we go for synthetic benchmarks that provide measurements with notable resource usage on specific hardware components. Selecting a suitable program, $p_1$ is the first step as seen in the activity diagram in Chapter 2.

Taking the performance metrics required for Case 1 into consideration, programs that stress the memory buses and generate significant data traffic to the memory are required. The applications used to generate load on the PC were:

- **Prime95 Test**: Prime95 is an application software that has a feature called Torture Test that allows maximum stress testing on the CPU and RAM. There are several options allowing the stress test to focus on the memory, processor, or a balance of both. This test was initially used and the memory bus traffic was observed. As the application fails to generate sufficient traffic on the memory bus, it is not used as a benchmark for creating the model.

- **STREAM Benchmark**: The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. This application generates sufficient traffic on the memory bus by working on datasets much larger than the available cache on any system. Therefore, we chose STREAM benchmark.

**STREAM**

- STREAM is intended to measure the bandwidth from main memory. It measures the performance of four long vector operations. These operations are listed in Table 3.2.

- The operations in Table 3.2 are representative of the building blocks of long vector operations. The array sizes are defined so that each array is larger than the cache of the machine to be tested, and the code is structured so that data re-use is not possible.

- Each of the four tests adds independent information to the results:
  - *Copy* measures transfer rates in the absence of arithmetic.
  - *Scale* adds a simple arithmetic operation.
– *Sum* adds a third operand to allow multiple load/store ports on vector machines to be tested.

– *Triad* allows chained/overlapped/fused multiply/add operations.

- The general rule\[10\] for STREAM is that each array must be at least 4 times the size of the sum of all the last-level caches used in the run, or 1 Million elements, whichever is larger. For example, on the dual socket PC with 8 MiB $L_3$ cache we used, each array needs to be of $\max(4 \times 8 \text{MiB}/8\text{B}, 10^6)$ elements which is equal to 4 million elements.

<table>
<thead>
<tr>
<th>name</th>
<th>Vector Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td>COPY</td>
<td>$a(i)=b(i)$</td>
</tr>
<tr>
<td>SCALE</td>
<td>$a(i)=q \times b(i)$</td>
</tr>
<tr>
<td>SUM</td>
<td>$a(i)=b(i)+c(i)$</td>
</tr>
<tr>
<td>TRIAD</td>
<td>$a(i)=b(i)+q \times c(i)$</td>
</tr>
</tbody>
</table>

**STREAM Application steps**

The application traverses the specified usage scenario (array size and offset) from the start action to the stop action for specified number of times. The steps for one vector kernel operation (COPY) are summarized in Figure 3.4. In Figure 3.4 the steps traversed from the start action to the stop action are depicted. Sum, Add and Triad vector kernel operations are also performed in the same fashion. The peak memory bandwidth for every vector kernel operation is calculated. Note that STREAM uses 8 threads (same as the number of cores on the PC chosen) in parallel for memory allocation and vector kernel operations. STREAM is a command-line application software and can be tuned to accept user input for $N$, NTIMES and OFFSET where,

- $N$ indicates array size (number of elements in the array)

- NTIMES indicates the number of trials to be performed

- OFFSET is the parameter used to achieve peak bandwidth value by adjusting $N$(not used in the context of this project)

As we are interested in performance modeling of the PC with different memory configurations, STREAM is used to measure memory bus traffic by choosing the array size carefully.
3.2.3 Assumptions and Decisions

There are several assumptions used and decisions made in the course of model creation and the use of accompanying tools. The most important ones are listed below.

- **Number of Sockets**: The analysis makes use of a dual socket machine (IP PC). Metrics such as QPI traffic which are required for creating the model (will be seen later) can be captured only when both the sockets are identified by the OS. For this, it is required to turn the Non Uniform Memory Access (NUMA) on in the BIOS setting of the PC.
• **Measurement Tool:**
  
  – The MeasureLoadCLI tool is used to capture resource usage metrics such as core usage, core L3 cache miss, core IPC, network bytes sent and received. The RUMM Tool can also be used to measure metrics such as socket usage, QPI traffic, memory bus traffic. But, it only gives coarse level measurements of metrics without including low level metrics such as the core usage, cache hits and misses. Moreover, the timer used in RUMM Tool supports a granularity of 1s. MeasureLoadCLI can support a granularity of 1ms. Taking these factors into consideration, the MeasureLoadCLI tool is chosen for RUM.

  – Perfmon is used to collect windows performance metrics such as hard page faults, page reads, disk queue length that are relevant to the model creation.

Refer to the activity diagram in Chapter 2 to see the stages in which the tools are used.

### 3.2.4 Model for latency

This section presents the details of the creation of descriptive models which can be used to predict/estimate the latency of an application. The latency of the selected application in this context is the total running time of the application. It can be defined as the total time spent on all the steps traversed between the start and the stop action. It is estimated in terms of the cache, memory and disk access time (if any). The part of the hardware model studied is highlighted in Figure 3.5 using a red dotted box. The parts highlighted in green are those metrics which are measurable. Note that there are many other metrics measured using Perfmon and MeasureLoadCLI tool which are not significant in the context of this part of the thesis.

The hardware and processor models are adopted from Figure 3.2 and Figure 3.3. The hardware specification of the PC is listed in Table 3.1.

The hardware model shown in Figure 3.5 can be studied from a high level using a redefined simplified representation. This includes the processors, L1, L2, L3, Memory and Disk. See Figure 3.6.

The metrics that are captured using MeasureLoadCLI are obtained using the Intel PCM 2.3. If \( M(i, j) \) represents the number of \( j \) level cache misses in the core \( i \), \( H(i, j) \) represents the number of \( j \) level cache hits in the core \( i \), \( M(p, j) \) indicates the number of \( j \) level cache misses in a processor(CPU) \( p \) and \( H(p, j) \) indicates the number of \( j \) level

\(^2\)Only the incoming QPI metrics are measurable on a Nehalem PC.
cache hits in a processor $p$, then intuitively, the interpretation of some terms of interest are presented in Table 3.3.

The interpretation of the terms is based on the fact, that intuitively we assume the caches to be exclusive\(^3\). However, the actual definitions of the counters according to the Intel counters used in the PCM differ from those presented in Table 3.3. This difference can be explained because of the cache organisation on a Nehalem PC.

$L_1$ Cache: At Level 1, separate instruction and data caches are part of the Nehalem core. The instruction and the data cache are each 32 KiB in size. The instruction and the data caches have 4-way and 8-way set associative organization, respectively.

$L_2$ Cache: Each core also contains a private, 256 KiB, 8-way set associative, unified level 2 cache (for both instructions and data). The write policy is write-back and the cache is non-inclusive.

$L_3$ Cache: The Level 3 cache is a unified, 16-way set associative, 8 MiB inclusive cache shared by all four cores on the chip.

\(^3\)In exclusive cache, data is guaranteed to be in utmost one of the cache levels.
### Table 3.3: Interpreting Cache terminology: $p = 1, 2$ and $i = 1, \ldots, 8$ on the PC selected

<table>
<thead>
<tr>
<th>Term</th>
<th>Formula</th>
<th>Interpretation</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L_3$misses</td>
<td>$\sum_{p=1}^{2} M(p, 3)$</td>
<td>Number of $L_3$ cache requests that results in a miss in the $L_3$ cache only</td>
</tr>
<tr>
<td>$L_2$misses</td>
<td>$\sum_{i=1}^{8} M(i, 2)$</td>
<td>Number of $L_2$ cache requests that results in a miss in the $L_2$ cache only</td>
</tr>
<tr>
<td>$L_1$miss</td>
<td>$\sum_{i=1}^{8} M(i, 1)$</td>
<td>Number of $L_1$ cache requests that results in a miss in the $L_1$ cache only</td>
</tr>
<tr>
<td>$L_3$hits</td>
<td>$\sum_{p=1}^{2} H(p, 3)$</td>
<td>Number of requests to $L_3$ cache that result in a hit in the $L_3$ cache only</td>
</tr>
<tr>
<td>$L_2$hits</td>
<td>$\sum_{i=1}^{8} H(i, 2)$</td>
<td>Number of requests to $L_2$ cache that result in a hit in the $L_2$ cache only</td>
</tr>
<tr>
<td>$L_1$hits</td>
<td>$\sum_{i=1}^{8} H(i, 1)$</td>
<td>Number of requests to $L_1$ cache that result in a hit in the $L_1$ cache only</td>
</tr>
</tbody>
</table>

In order to understand the definitions used by Intel PCM, we need to distinguish between the nature of the cache used. Contrary to the assumption, the $L_3$ cache is inclusive (unlike $L_1$ and $L_2$), meaning that a cache line that exists in either $L_1$ data or instruction, or the $L_2$ unified caches, also exists in $L_3$ cache. The $L_3$ is designed to use the inclusive nature to minimize snooping traffic between processor cores and processor sockets. A cache block in use by a core in a socket is cached by its $L_3$ cache which can respond to snoop requests by other chips, without disturbing (snooping into) $L_2$ or $L_1$ caches on the same chip.

From Figure 3.7, it can be seen that if the cache is exclusive, then a miss in the $L_3$ cache needs a check in the other core’s $L_2$ cache as $L_3$ does not duplicate the contents of $L_2$ caches of individual cores [12, 13]. On the contrary, $L_3$ cache miss on an inclusive cache guarantees that the data is not on the die because the contents of $L_2$ are duplicated in $L_3$. Note that non-inclusive and exclusive do not imply the same meaning.

See Table 3.4 for the definitions from PCM.

The Intel PCM does not provide counters that measure $L_1$ cache metrics. As $L_1$ cache metrics are significant in the context of creating models, they are used in the model. $L_1$HitRatio is assumed and $L_1$Hit, $L_1$Miss are estimated. The definitions of $L_1$ cache metrics used in the model are listed in Table 3.5.

---

4 For any core, sibling core is a core on the same die or socket
5 $i = 1, \ldots, 4$ indicates number of cores in a processor; $p = 1, 2$ indicates the number of physical processors
6 The processors have the ability to eavesdrop the address bus for other processor’s accesses to system memory and to their internal caches called snooping. They use this snooping ability to keep their internal caches consistent both with system memory and with the caches in other interconnected processors.
7 In a non-inclusive cache, there is no enforcement of cache inclusion or exclusion. A cache line in an inner cache may or may not be in the outer cache.
### Table 3.4: Hardware Counters from Intel PCM

<table>
<thead>
<tr>
<th>Counter</th>
<th>Formula</th>
<th>Definition</th>
</tr>
</thead>
</table>
| \(L_2\text{HitM}\) | \[
\text{for } p=1,2 \{ \text{for } i = 1 \ldots 4 \{ \sum_{k \neq i} H(k,2) \} \}\right] 
|                 |                                                                         | Total number of \(L_2\) cache requests to every core that result in a hit in a sibling \(4\) \(L_2\) cache in every processor. \(^5\)(\(L_3\) being inclusive, includes this hit) |
| \(L_3\text{UnsharedHit}\) | -                                                                       | Number of \(L_3\) cache requests that result in a hit in the \(L_3\) cache with no snooping required |
| \(L_2\text{Hit}\) | \(L_2\text{hits}\)                                                     | Number of \(L_2\) cache requests that result in a hit in the \(L_2\) cache only |
| \(L_3\text{Hit}\) | \(L_2\text{HitM} + L_3\text{UnsharedHit}\)                             | Sum of the total number of \(L_2\) cache misses that result in a hit in a sibling core’s \(L_2\) cache (\(L_3\) being inclusive, has a copy) and the number of hits in \(L_3\) cache that requires no snooping |
| \(L_3\text{Miss}\) | \(L_3\text{misses}\)                                                   | Number of \(L_3\) cache requests that do not result in a hit in the \(L_3\) cache |
| \(L_2\text{Miss}\) | \(L_2\text{HitM} + L_3\text{UnsharedHit} + L_3\text{Miss}\)            | Total Number of \(L_2\) cache requests to every core that do not result in a hit in the same core’s \(L_2\) cache. \(L_2\text{Miss} \neq L_2\text{misses}\) |
| \(L_2\text{HitRatio}\) | \[
\frac{L_2\text{Hit}}{L_2\text{Hit} + L_2\text{Miss}}
\]                  | Ratio of the total number of requests to the \(L_2\) cache that result in a hit in the \(L_2\) cache to the total number of requests to the \(L_2\) cache |
| \(L_3\text{HitRatio}\) | \[
\frac{L_3\text{Hit}}{L_3\text{Hit} + L_3\text{Miss}}
\]                  | Ratio of the total number of the requests that result in a hit in the \(L_3\) to the total number of requests to the \(L_3\) cache |

The Venn diagram representation of the metrics relevant to the model (and used in PCM) is shown in Figure 3.8. Note that the diameter of the circles are just an indication of the relative numbers of the metrics obtained from STREAM. (It could be different for different applications.)

In order to estimate the latency of the application, it is important to determine the access times to the cache and memory organisation. The access times to caches are not
Figure 3.7: Exclusive versus Inclusive cache (Semin, 2009)

Table 3.5: $L_1$ cache metrics

<table>
<thead>
<tr>
<th>Metric</th>
<th>Formula</th>
<th>Interpretation</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L_1$Hit</td>
<td>$L_1$hits</td>
<td>Number of $L_1$ cache requests that result in a hit in the $L_1$ cache only</td>
</tr>
<tr>
<td>$L_1$Miss</td>
<td>$L_2$Hit + $L_2$Miss</td>
<td>Total Number of $L_1$ cache requests that do not result in a hit in the $L_1$ cache</td>
</tr>
<tr>
<td>$L_1$HitRatio</td>
<td>$\frac{L_1$Hit}{L_1$Hit + L_1$Miss}$</td>
<td>Ratio of the total number of $L_1$ cache hits to the total number of requests to the $L_1$ cache</td>
</tr>
</tbody>
</table>

Figure 3.8: Metrics used in the model represented using Venn Diagram

constant values [14][12][15]. They depend on various factors such as, core, un-core frequencies, type of access (remote/local, with snooping/without snooping). The following observations and study[14] form the basis for creating this model. See Figure 3.9

**Local Memory Access:** The steps to access the local memory block are as follows:
3.2. Execution Architecture Design-I

1. Proc0 requests a cache line which is not in its \( L_1, L_2, L_3 \) cache.
   - Proc0 requests data from its DRAM
   - Proc0 snoops Proc1 to check if data is present there.
2. Response
   - local DRAM returns data
   - Proc1 returns snoop response
   - Proc0 installs block in its \( L_3, L_2, L_1 \) cache and retrieves target memory word.

**Remote Memory Access:** The steps to access the remote memory block are the following:

1. Proc0 requests a cache line which is not in Proc0’s \( L_1, L_2, L_3 \) cache
2. Request sent over QPI to Proc1
3. Proc1’s probes for cache line
   - Proc1 makes requests to its own DRAM
   - Proc1 snoops internal caches
4. Response
   - Data block returns to Proc0 via the QPI
   - Proc0 installs cache block in \( L_3, L_2, L_1 \)

![Figure 3.9: Nehalem Memory Access Sequence (Thomakadis, 2011)](image)

From the steps mentioned, it can be inferred that the access latency is a function of QPI
latency and the number of local and remote accesses to memory and caches. The cache coherency protocol messages (snooping) among the multiple sockets are exchanged over the QPI. We have taken these factors into consideration and modelled the cache and memory access times as a function of QPI traffic, local and remote memory access.

The access times to $L_2, L_3$ and memory are calculated and denoted as $t_{hit2}, t_{hit3}, t_{mem}$. Looking back at Figure 2.6 using STREAM as $p_1$, descriptive models are created as a function of the profile of metrics created on the machine and the hardware model of the machine. The set of models $d_1, \ldots, d_4 \in D$ are presented.

\[
d_1 = F(\{QPI_{sock}, M_r, M_w\}, \hat{H}m) : \\
RemoteAccessRatio(RAR) = \frac{QPI_{sock}}{M_r + M_w} \tag{3.1}
\]

\[
d_2 = F(\{RAR, t_{2remote}, t_{2local}\}, \hat{H}m) : \\
t_{hit2} = RAR \times t_{2remote} + (1 - RAR) \times t_{2local} \tag{3.2}
\]

\[
d_3 = F(\{RAR, t_{3remote}, t_{3local}\}, \hat{H}m) : \\
t_{hit3} = RAR \times t_{3remote} + (1 - RAR) \times t_{3local} \tag{3.3}
\]

\[
d_4 = F(\{RAR, t_{memremote}, t_{memlocal}\}, \hat{H}m) : \\
t_{mem} = RAR \times t_{memremote} + (1 - RAR) \times t_{memlocal} \tag{3.4}
\]

where,

- $t_{2remote}, t_{3remote}, t_{memremote}$ are the access times to remote $L_2, L_3$, DRAM respectively.
- $t_{2local}, t_{3local}, t_{memlocal}$ are the access times to local $L_2, L_3$, DRAM respectively.
- $M_r$ indicates the total amount of traffic read from the local and the remote DRAM by the system.
- $M_w$ indicates the total amount of traffic written into the local and the remote DRAM by the system.
- $QPI_{sock}$ indicates the total traffic between the sockets (CPUs) via the QPI link.

The RAR is estimated using the principle that all remote accesses happen via the QPI.

A cache miss in $L_3$ in one socket finds the data in another socket’s $L_3$ cache (inclusive) or DRAM (local or remote). We have also considered the worst case scenario in which the $L_2$ cache lines are modified and also requires snooping and thus included remote access latency for $L_2$ cache. The access time to $L_1$ is assumed to be constant\[^{15}\] as $L_3$ being inclusive accounts for the snooping requests of $L_1$ cache. The access time to disk (in ms) is extremely high when compared to the QPI latencies (in ns). So, we ignore the QPI latencies when determining the disk latency.
The access times to $L_1$ and disk are denoted by $t_{hit1}$ and $t_{disk}$ respectively. Taking the metrics presented in Table 3.5 and Table 3.4 and the models $d_1, \ldots, d_4$ into consideration, the model used to calculate the total latency of the application on a PC with a hardware model $\hat{H}m$ is $d_{10} \in D$ and represented using Equation 3.5.

\[
d_{10} = F(d_1, \ldots, d_9, \hat{H}m) :
\]

\[
T_{tot} = t_{hit1} \cdot L_1\text{Hit} + t_{hit2} \cdot L_2\text{Hit} + t_{hit3} \cdot L_3\text{Hit} + t_{mem} \cdot \text{MemHit} + t_{disk} \cdot \text{DiskHit}
\]  
(3.5)

where, $d_5, \ldots, d_9 \in D$ are presented using the following Equations.

\[
d_5 = F(\text{PageReads}, \hat{H}m) :
\]

\[
\text{DiskHit} = \text{PageReads}
\]  
(3.6)

\[
d_6 = F(L_3\text{Miss}, \hat{H}m) :
\]

\[
\text{MemHit} = L_3\text{Miss}
\]  
(3.7)

\[
d_7 = F(\{L_3\text{HitRatio}, L_3\text{Miss}\}, \hat{H}m) :
\]

\[
L_3\text{Hit} = \frac{L_3\text{HitRatio} \cdot L_3\text{Miss}}{1 - L_3\text{HitRatio}}
\]  
(3.8)

\[
d_9 = F(\{L_2\text{HitRatio}, L_2\text{Miss}\}, \hat{H}m) :
\]

\[
L_2\text{Hit} = \frac{L_2\text{HitRatio} \cdot L_2\text{Miss}}{1 - L_2\text{HitRatio}}
\]  
(3.9)

\[
d_8 = F(\{L_1\text{HitRatio}, L_1\text{Miss}\}, \hat{H}m) :
\]

\[
L_1\text{Hit} = \frac{L_1\text{HitRatio} \cdot L_1\text{Miss}}{1 - L_1\text{HitRatio}}
\]  
(3.10)

\[\text{DiskHit}: \text{This represents the number of disk accesses. This can be estimated using the number of Page Reads/s. Page Reads/s is a memory counter measured using Perfmon. This counter indicates the number of read operations that were required to be done by the disk to retrieve faulted pages (One of the possible scenarios where this is most likely to happen is when the system runs out of memory and accesses the disk).}\]

\[\text{MemHit}: \text{This represents the number of accesses that find data in memory. It is assumed that all } L_3\text{Miss result in memory hits.}\]

\[\text{Looking back at Figure 2.6, the next step is to collect measurements of hardware metrics on selected hardware platforms by conducting experiments. These metrics are used to predict relevant (e.g, runtime) parameters based on just the application properties.}\]
3.3 Experiments and Results-I

Ralph Waldo Emerson quoted,

“All life is an experiment. The more experiments you make, the better”

The experiments are conducted to collect measurements of hardware metrics to create the models presented in Section 3.2.4 and verify its correctness. In this Section, the experiments conducted for case 1 are described which is: Series of experiments for different memory configurations, which are, 1GiB, 2GiB, 4GiB, 6GiB.

The generic sequence of steps followed from creating the models to verifying them are illustrated in Figure 3.10.

Figure 3.10: Creating models to verifying models: Sequence of steps. The notes (in green) is the name of the action completed in each step.

An instantiation of the sequence of steps is done for model, $d10$. 
3.3. Experiments and Results-I

Figure 3.11: Creating model to verifying model, d10: Steps

The rest of this section is organised to present individual actions performed in Figure 3.10.

3.3.1 Measurements

The models $d_1, \ldots, d_{10}$ are created using the measurements from experiments and analysis done in incremental steps. Therefore, the action, model creation shown in Figure 3.10 is not a single step action, but includes a number of steps within itself. The models created are presented in Section 3.2.4.

The next step in Figure 3.10 is collecting measurements of hardware metrics using the MeasureLoadCLI and Perfmon Tools. The application chosen is STREAM for reasons mentioned in Section 3.2.2.

The experiments are conducted on an IP PC with hardware specifications presented in Table 3.1. STREAM application is run for various array sizes ranging from 1 million elements to 80 million elements. The Measurement Tools, MeasureLoadCLI and Perf-
mon are run concurrently when STREAM is launched. The parameters used in the experiments are listed in Table 3.6. The OS used on the PC is Windows XP.

Table 3.6: Application and Measurement Tool specifications: These are the user input values chosen during the experiments to obtain sufficient samples for creating the model.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>NTIMES</td>
<td>100</td>
</tr>
<tr>
<td>OFFSET</td>
<td>0</td>
</tr>
<tr>
<td>Time interval between samples during measurement (s)</td>
<td>1</td>
</tr>
</tbody>
</table>

The list of metrics measured for the series of experiments conducted are tabulated in Table 3.7. The measured values of metrics are tabulated in Tables 3.21, 3.22, 3.23, 3.24.

Table 3.7: Measured Metrics from the tools

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>( L_2 )HitRatio</td>
<td></td>
</tr>
<tr>
<td>( L_3 )HitRatio</td>
<td></td>
</tr>
<tr>
<td>( L_3 )Miss</td>
<td></td>
</tr>
<tr>
<td>( L_2 )Miss</td>
<td></td>
</tr>
<tr>
<td>DiskHit</td>
<td></td>
</tr>
<tr>
<td>MemHit</td>
<td></td>
</tr>
<tr>
<td>( M_r )</td>
<td></td>
</tr>
<tr>
<td>( M_w )</td>
<td></td>
</tr>
<tr>
<td>QPIsock</td>
<td></td>
</tr>
</tbody>
</table>

3.3.2 Assumptions and Calculations

This section entails the description of metrics assumed and metrics calculated from assumed metrics and measured metrics. It also describes the rationale behind the choice of important metrics assumed.

The access times to the local and remote caches and memory for a PC with specifications mentioned in Table 3.1 are shown in Table 3.8. The values of the access latencies chosen are based on Intel and Nehalem literature sources [14, 12, 15]. Although, these values are assumed, they are constant for all array sizes chosen.

Access time to remote memory by a processor is approximately 1.6 to 1.7 times the access time to local memory [14]. The latency to remote caches is due to the QPI latency and this is approximately equal to 40 ns [14, 12]. This is used to calculate the access times to remote caches, \( L_2 \), \( L_3 \) and memory. (\( \therefore t_{3\text{remote}} = t_{3\text{local}} + \text{QPI latency} = 20 + 40 = 60 \text{ ns} \);
Table 3.8: Access Times to Cache and Memory

<table>
<thead>
<tr>
<th>Cache Level /Memory</th>
<th>Local</th>
<th>Remote</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L_1$ (ns)</td>
<td>$t_{hit1}$</td>
<td>$t_{hit1}$</td>
</tr>
<tr>
<td></td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$L_2$ (ns)</td>
<td>$t_{local}$</td>
<td>$t_{remote}$</td>
</tr>
<tr>
<td></td>
<td>6</td>
<td>46</td>
</tr>
<tr>
<td>$L_3$ (ns)</td>
<td>$t_{local}$</td>
<td>$t_{remote}$</td>
</tr>
<tr>
<td></td>
<td>20</td>
<td>60</td>
</tr>
<tr>
<td>DRAM (ns)</td>
<td>$t_{memlocal}$</td>
<td>$t_{memremote}$</td>
</tr>
<tr>
<td></td>
<td>65</td>
<td>120</td>
</tr>
<tr>
<td>Disk (ms)</td>
<td>$t_{disk}$</td>
<td>$t_{disk}$</td>
</tr>
<tr>
<td></td>
<td>2</td>
<td>2</td>
</tr>
</tbody>
</table>

Similarly for $L_2$). The access times to local caches and memory chosen and the access times to remote caches and memory calculated are used to calculate the total access times to $L_2$, $L_3$ caches and memory, $t_{hit2}$, $t_{hit3}$, $t_{mem}$ for all array sizes chosen. The values of the access times are tabulated for all array sizes in Tables 3.21, 3.22, 3.23, 3.24.

The value of $L_1$ HitRatio is not measured by the tools used. However, the value chosen is not a constant value and ranges from 0.95 to 0.99 depending on the array size. The value chosen is greater than 0.95 because, generally on any PC, the $L_1$ HitRatio is close to 1. A very low $L_1$ HitRatio indicates that the application does not use the cache effectively. The value chosen for different array sizes gives the best regression fit for every experiment. Looking back at Figure 3.11, the value of $L1$ HitRatio chosen becomes a part of curve fitting for models created.

The choice of $L_1$ HitRatio for various array sizes is shown in Figure 3.12. From Figure 3.12, it is seen that value chosen shows variations for larger array sizes. Larger the array size, more the memory required. The memory required is calculated as:

$$\text{Maximum number of operands required for the vector kernel operation} \times \text{Array Size} \times \text{Number of Bytes per array element}$$

For example, for an ADD operation with array size of one million elements, the memory required is $3 \times 10^6 \times 8 = 22.9 \text{ MiB}$.

The access time to disk ranges from (1 ms-100 ms). We chose 2 ms because this value gives the best regression fit in the model for all possible memory configurations chosen.

RAR is calculated as using the formula represented by model, $d_1$.

The metrics assumed and calculated and used in models, $d_1, \ldots d_{10}$ are listed in Table 3.9.
3.3.3 Prediction and Verification

This section entails the details about the prediction of a metric on selected hardware models and the verification of the models.

The hardware models for four selected memory configurations are denoted as $H\hat{m}_1$, $H\hat{m}_2$, $H\hat{m}_3$, $H\hat{m}_4$ for 1 GiB, 2 GiB, 4 GiB and 6 GiB memory configurations respectively on machine $m$ selected.

The metrics measured (listed in Table 3.7), and metrics assumed and calculated (listed in Table 3.9) are used to create profiles, $K1$, $K2$, $K3$, $K4$ for hardware models. The profiles are used to predict the metric, $T_{tot}$ for all the hardware models.
The mathematical correctness of the predicted metric for every hardware model is verified by comparing it with the metric, $t_{app}$ measured. $t_{app}$ is the time duration between the start and the stop action of the application measured using a timer used in the MeasureLoadCLI tool. The predicted value of $T_{tot}$ estimated using Equation 3.5 is compared with $t_{app} \notin K1, K2, K3, K4$ for all hardware models using 3.11.

\[ \left| T_{tot} - t_{app} \right| \leq \text{error} \] (3.11)

In order to determine if the predicted ($T_{tot}$) and measured ($t_{app}$) values of latencies are comparable, the error percentage is calculated as follows:

\[ \text{error(\%)} = \frac{T_{tot} - t_{app}}{t_{app}} \times 100 \] (3.12)

Looking back at Figure 3.11, the step, prediction and verification are done as described in this section.

**Reflection**

The predicted value of the metric, $T_{tot}$ and the error percentages calculated after comparing $T_{tot}$ with $t_{app}$ for all array sizes is tabulated in Tables 3.21, 3.22, 3.23, 3.24. These values depend on a number of useful observations and analysis done by comparing different memory configurations. They are discussed in detail in this section.

In Table 3.10 for $\hat{H}m_1$,

- $m.MemBw$ is calculated as (Memory Frequency * Number of bytes/transfer * Number of channels) = (800 * 8 * 1 = 6.4 GB/s). This is the theoretical bandwidth for one socket. $m.Sock1Mem$ and $m.Sock2Mem$ indicate the amount of memory available to node 1 (socket 1) and node 2 (socket 2) respectively. These are represented separately because of NUMA.

Similarly, metrics for other hardware models are also specified in Table 3.10.
Table 3.10: Hardware models selected: The metrics are calculated from the specification sheet of the PC, and using a software, SiSoftware Sandra

<table>
<thead>
<tr>
<th>Hardware Model</th>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>$Hm_1$</td>
<td>m.Mem (GiB)</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>m.MemChannel</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>m.MemBw(GB/s)</td>
<td>6.4</td>
</tr>
<tr>
<td></td>
<td>m.Sock1Mem(MiB)</td>
<td>537</td>
</tr>
<tr>
<td></td>
<td>m.Sock2Mem(MiB)</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>m.QPIBw(between sockets)(GB/s)</td>
<td>9.6</td>
</tr>
<tr>
<td>$Hm_2$</td>
<td>m.Mem (GiB)</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td>m.MemChannel</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td>m.MemBw(GB/s)</td>
<td>6.4</td>
</tr>
<tr>
<td></td>
<td>m.Sock1Mem(MiB)</td>
<td>582</td>
</tr>
<tr>
<td></td>
<td>m.Sock2Mem(MiB)</td>
<td>861</td>
</tr>
<tr>
<td></td>
<td>m.QPIBw(between sockets)(GB/s)</td>
<td>9.6</td>
</tr>
<tr>
<td>$Hm_3$</td>
<td>m.Mem (GiB)</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td>m.MemChannel</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td>m.MemBw(GB/s)</td>
<td>12.8</td>
</tr>
<tr>
<td></td>
<td>m.Sock1Mem(MiB)</td>
<td>1760</td>
</tr>
<tr>
<td></td>
<td>m.Sock2Mem(MiB)</td>
<td>410</td>
</tr>
<tr>
<td></td>
<td>m.QPIBw(between sockets)(GB/s)</td>
<td>9.6</td>
</tr>
<tr>
<td>$Hm_4$</td>
<td>m.Mem (GiB)</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td>m.MemChannel</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>m.MemBw(GB/s)</td>
<td>19.2</td>
</tr>
<tr>
<td></td>
<td>m.Sock1Mem(MiB)</td>
<td>2130</td>
</tr>
<tr>
<td></td>
<td>m.Sock2Mem(MiB)</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>m.QPIBw(between sockets)(GB/s)</td>
<td>9.6</td>
</tr>
</tbody>
</table>

An important observation from Table 3.10 is that the memory available is always less than the total physical memory physically installed on a system. The $m.Sock2Mem$ for $Hm_1$ is zero because no RAM card was installed on socket 2. However, $m.Sock2Mem$ for $Hm_4$ is surprisingly found to be zero even though 3 GiB of RAM is installed on socket 2. This can be explained using the nature of the OS. A 32 bit Windows OS can use only a maximum of $2^{32} = 4GiB$ of memory. Therefore, even though $Hm_4$ has 6 GiB of memory installed, the OS only uses the memory installed on socket 1 (3 GiB). This is also proved using the measurements from the MeasureLoadCLI tool. The metrics, $M_r$ and $M_w$ for individual sockets are measured and plotted in Figure 3.13. It can be seen that the socket 2 does not record any traffic to the memory as it has no memory allocated.
3.3. Experiments and Results-I

Figure 3.13: Individual socket reads and writes in $Hm_1$ and $Hm_4$: Shows that Socket 2 does not record any memory bus traffic

Figure 3.14: Latency of STREAM application for $Hm_1$, $Hm_2$, $Hm_3$, $Hm_4$: The y-axis is the log of the values measured and predicted.

t_{app}$ and $T_{tot}$ tabulated in Table 3.21, Table 3.22, Table 3.23, Table 3.24 are graphically represented in Figure 3.14. In Figure 3.14a and Figure 3.14b, the latencies are plotted against the memory required for the vector kernel operations listed in Table 3.2.

From Figure 3.14 it can be seen that the estimated latency compares very well with the measured latency. There is a drastic increase in the latency of the application.
when the memory required is greater than 80% of the total memory deployed on every hardware model. This can be demonstrated using Figure 3.15. In Figure 3.15, the estimated latency of the application is plotted against the % of memory required. The % of memory required is Total memory required/Total memory as seen by the OS. For example, for $Hm_1$, the % of memory required for 1 million elements is $22.9 \text{ MiB}/1 \text{ GiB}=2.24\%$. It is seen that the percentage of memory used for the same memory requirements is higher for $Hm_4$ than that of $Hm_3$. This factor is important from the point of view of performance comparison discussed in the later section. For $Hm_4$, the total amount of memory seen by the OS is less than 3 GiB. This accounts for higher percentage of memory required for this hardware model than that required by $Hm_3$. See from Figure 3.15 that the % of memory required is never greater than 45% for $Hm_3(4 \text{ GiB})$and is greater than 80% for $Hm_4(6 \text{ GiB})$ for experiments with same array size chosen.

From Table 3.21, Table 3.22, Table 3.23, Table 3.23, the rows highlighted in pink show that the modeling is valid only for those array sizes where the % of memory required is less than 80 %. The absolute value of the error percentages for valid experiments is less than 1 %. This satisfies Equation 3.11 because $T_{tot}$ and $t_{app}$ are comparable to each other within an absolute error of 1 %. This verifies the predictive quality of the models, $d_1, \ldots, d_{10}$ for all hardware models selected. Looking back at Figure 3.11 all the steps shown are completed and the predictive quality of $d_{10}$ is verified. $d_{10}$ and thus, $d_1, \ldots, d_{9}$ can now be formalized. The set of models can be validated using an entirely different machine, or architecture as illustrated in Figure 2.7.
3.3. Experiments and Results

3.3.4 Performance Evaluation using models

As the predictive quality of the models created are verified, we use the models to compare the performance of the PC with different memory configurations. When the memory installed on $m$ is increased in steps of 2 GiB, we intuitively expect to see performance improvement. In terms of latency, we expect a decrease in the latency of all the experiments for every step increase in the installed memory. However, when the system was modelled using Equation 3.5, the results obtained were quite contradictory for hardware model, $Hm_4$. This can be demonstrated using a state diagram shown in Figure 3.16.

The performance improvement/degradation % is calculated as the average of the ratios of the difference between the estimated latencies of the application between two hardware models to the estimated latency of the application of one hardware model for all the array sizes chosen.

Figure 3.16: Performance Improvement in terms of Latency: The circles represent the hardware models of $m$; The text in green indicates performance improvement and the text in red (and minus sign) indicates performance degradation.

Here, if the latency estimated decreases for every step increase in the memory installed, then there is performance improvement. If the latency of the application increases, it implies performance degradation.

A performance improvement/degradation of $x$ % from $Hm_i$ to $Hm_j$ means that the value of a metric $kv$ for $Hm_j = (1-(x/100)) \times kv$ for $Hm_i$.

The performance improvement/degradation shown in Figure 3.16 can be explained using the metrics used in $d_{10} \in D$.

From Figure 3.17a, it can be seen that the number of $L_1Miss$ decreases by 77 % from $Hm_1$ to $Hm_2$. The number of $L_2Miss$ and $L_3Miss$ also decrease by 63 % and 61 % respectively as seen in Figure 3.17b and Figure 3.17c. However, the number of disk accesses is extremely high as seen from Figure 3.17d. We still see an overall performance improvement of 48 % from $Hm_1$ to $Hm_2$ in Figure 3.16. An improvement of 48 % means that $T_{tot}$ for $Hm_2 = 0.52 \times T_{tot}$ for $Hm_1$.

---

8 All here refers to those array sizes for which the % of memory required is less than 80 %
Figure 3.17: Performance Evaluation of the estimated latencies using the model, d10 using the metrics: The circles represent the hardware models. Performance improvement is indicated using green text and performance degradation is indicated using red text.

Table 3.11: Average of the access times to cache and memory for hardware models: The values are rounded to the nearest integers. They are the average of all the values estimated for all array sizes. The access times are not constant values, but instead are functions of RAR.

<table>
<thead>
<tr>
<th>Hardware Model</th>
<th>( t_{hit2} )</th>
<th>( t_{hit3} )</th>
<th>( t_{mem} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( Hm_1 )</td>
<td>21</td>
<td>41</td>
<td>94</td>
</tr>
<tr>
<td>( Hm_2 )</td>
<td>12</td>
<td>26</td>
<td>74</td>
</tr>
<tr>
<td>( Hm_3 )</td>
<td>17</td>
<td>36</td>
<td>66</td>
</tr>
<tr>
<td>( Hm_4 )</td>
<td>29</td>
<td>43</td>
<td>96</td>
</tr>
</tbody>
</table>

Table 3.12: QPI, Remote access and Memory traffic ratios: The values are approximate and not exact values. They are the average of all the values estimated for all array sizes.

<table>
<thead>
<tr>
<th>Hardware Model</th>
<th>( \frac{QPI_{sock2}}{QPI_{sock1}} )</th>
<th>RAR</th>
<th>( \frac{M_r}{M_w} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>( Hm_1 )</td>
<td>2.5</td>
<td>0.5</td>
<td>2.5</td>
</tr>
<tr>
<td>( Hm_2 )</td>
<td>1.4</td>
<td>0.2</td>
<td>2.5</td>
</tr>
<tr>
<td>( Hm_3 )</td>
<td>1.5</td>
<td>0.2</td>
<td>2.5</td>
</tr>
<tr>
<td>( Hm_4 )</td>
<td>2.5</td>
<td>0.5</td>
<td>2.5</td>
</tr>
</tbody>
</table>

An important inference here is that the performance improvement gained by an increase of 1 GiB in the memory configuration is not only due to the increase in the memory, but also, due to the way in which the memory is installed into the DIMMs in every
socket. In Table 3.10 and Figure 3.13 we have seen that no memory is allotted to socket 2. However, from Table 3.12, we can see that the traffic to socket 2 is 2.5 times the traffic to socket 1 and the RAR is approximately 0.5 for $Hm_1$. It is also seen that the snooping traffic is equal to the ratio of $M_r$ to $M_w$ (which is 2.5). So, it can be inferred that the traffic via the QPI bus is only due to snooping. This increases the average $thit_2$, $thit_3$ and $t_{mem}$ access times to 21 ns, 41 ns and 94 ns as shown in Table 3.11. This is higher than the local cache and memory access times discussed in Table 3.8. Refer Table 3.21 and Table 3.22 to see the access times to the cache and memory and RAR for all the array sizes. From Table 3.12 it can also be seen that for $Hm_2$, the traffic to socket 2 is 1.5 times the traffic to socket 1 via the QPI bus. The RAR is also close to 0.20. The QPI and RAR indicate that the number of remote accesses and snooping traffic is reduced. This decreases the average $thit_2$, $thit_3$ and $t_{mem}$ access times to 12 ns, 26 ns, 74 ns from $Hm_1$ to $Hm_2$. The overall performance improvement in terms of latency can be seen in Figure 3.14.

From Figure 3.17a, it can be seen that the number of $L_1$Miss decreases by 9% from $Hm_2$ to $Hm_3$. The number of $L_2$Miss and $L_3$Miss also decrease by 23% as seen in Figure 3.17b and Figure 3.17c. The number of disk accesses is reduced by 99% as seen from Figure 3.17d. We see an overall performance improvement of 27% from $Hm_2$ to $Hm_3$.

The performance improvement gained from $Hm_2$ to $Hm_3$ (a step of 2GiB) is only 27% unlike a performance improvement of 48% from $Hm_1$ to $Hm_2$ as seen in Figure 3.16. The traffic to socket 2 is 1.5 times the traffic to socket 1 and RAR is close to 0.20 as seen from Table 3.12. The average access times, $thit_2$, $thit_3$ and $t_{mem}$ estimated are 17 ns, 36 ns and 66 ns. The $thit_2$, $thit_3$ are higher than those estimated for $Hm_2$. Nevertheless, the number of $L_1$Miss, $L_2$Miss, $L_3$Miss and DiskHit are lesser in $Hm_2$. The overall performance improvement in terms of latency can be seen in Figure 3.14.

From Figure 3.17a, it can be seen that the number of $L_1$Miss increases by 116% from $Hm_3$ to $Hm_4$. Similarly, the number of $L_2$Miss and $L_3$Miss also increase by 41% and 44% as seen in Figure 3.17b and Figure 3.17c. The number of disk accesses increases by 18% as seen from Figure 3.17d. We see an overall performance degradation of 19% from $Hm_3$ to $Hm_4$.

From Figure 3.17a, it can be seen that the number of $L_1$Miss increases by 134% from $Hm_2$ to $Hm_4$. Similarly the number of $L_2$Miss and $L_3$Miss also increase by 22% and 26% as seen in Figure 3.17b and Figure 3.17c. But, the number of disk accesses reduces by 99% as seen from Figure 3.17d. We see an overall performance improvement of 8% from $Hm_2$ to $Hm_4$. 
The performance degradation seen from $\hat{H}m_3$ to $\hat{H}m_4$ can be explained based on the amount of memory seen by the OS. As discussed and demonstrated earlier through Table 3.10 and Figure 3.13, the memory installed on socket 2 is not available. This is similar to $\hat{H}m_1$ in exhibiting huge QPI traffic for snooping and leading to increased access times to cache and memory. From Table 3.11 and Table 3.12, it can be seen that the access latencies are 29ns, 43 ns and 96 ns and the RAR is approximately equal to 0.50 which are higher than the access latencies and RAR of $\hat{H}m_3$. The overall performance degradation in terms of latency can be seen in Figure 3.14.

The most interesting, yet, explanatory meagre performance improvement of 8 % from $\hat{H}m_2$ to $\hat{H}m_4$ is solely due to the decrease in the number of disk accesses. It can be inferred that a step of 4 GiB increase (from 2 GiB to 6 GiB) does not do the intuitive expected performance improvement. This is seen in Figure 3.14.

Note from Figure 3.16:

- The performance improvement gained from $\hat{H}m_2$ to $\hat{H}m_4$ is 8% which is = 27 % - 19% = 8%

- The performance improvement gained from $\hat{H}m_2$ to $\hat{H}m_3$ is 27 % which is ≈ (48 % - 19%)=29%.

From this, it can be inferred that the performance degradation that happens when the memory is installed on only one processor(socket) or memory is available on one socket in a multiprocessor PC is ≈ 20 %.

From Figure 3.17a, it is seen that the number of $L_1$Miss decreases by 82 % from $\hat{H}m_1$ to $\hat{H}m_3$. The number of $L_2$Miss and $L_3$Miss also decrease by 72 % and 70 % as seen in Figure 3.17b and Figure 3.17c. The number of disk accesses is reduced by 97 % as seen from Figure 3.17d. We see an overall performance improvement of 63 % from $\hat{H}m_1$ to $\hat{H}m_3$.

From Figure 3.17a, it is seen that the number of $L_1$Miss decreases by 47 % from $\hat{H}m_1$ to $\hat{H}m_4$. The number of $L_2$Miss and $L_3$Miss also decrease by 53 % and 49 % as seen in Figure 3.17b and Figure 3.17c. The number of disk accesses is reduced by 96 % as seen from Figure 3.17d. We see an overall performance improvement of 52 % from $\hat{H}m_1$ to $\hat{H}m_4$.

The performance improvement gained from $\hat{H}m_1$ to $\hat{H}m_3$ and $\hat{H}m_1$ to $\hat{H}m_4$ are 63 % and 52 % respectively. The performance gain from $\hat{H}m_1$ to $\hat{H}m_4$ is lesser than the performance gain from $\hat{H}m_1$ to $\hat{H}m_3$ due to the way the memory is seen by the OS in $\hat{H}m_4$. This is explained in earlier observations.
3.3. Experiments and Results-I

If \( A > B \) indicates that, the performance of A is better than the performance of B, then, from Figure 3.16 and Figure 3.14 performance evaluation done can be summarized as:

\[
\hat{H}m_3 > \hat{H}m_4 > \hat{H}m_2 > \hat{H}m_1
\]  

(3.13)

3.3.5 Performance Evaluation using STREAM

Looking back at Figure 3.4 it was discussed that STREAM calculates the time required for vector kernel operations and the memory bandwidth. This is used to verify the results of performance evaluation done. As shown in Figure 3.4, STREAM calculates the minimum, maximum, average times and the optimum memory bandwidth for all the vector kernel operations. Note that we already have \( T_{tot} \) predicted and \( t_{app} \) measured using MeasureLoadCLI tool and the time recorded by STREAM is different from these metrics and is displayed on the command-line interface of STREAM after computations.

![Figure 3.18: Latency recorded by STREAM application: The latency is expressed in log scale: Lower the latency, better the performance](image)

- From Figure 3.18 it can be inferred from the latencies observed that,

\[
\hat{H}m_3 > \hat{H}m_4 > \hat{H}m_2 > \hat{H}m_1
\]  

(3.14)

- From Figure 3.19 it can be inferred from the memory bandwidth that,

\[
\hat{H}m_3 > \hat{H}m_4 > \hat{H}m_2 > \hat{H}m_1
\]  

(3.15)

Equations 3.15, 3.14 concur with Equation 3.13. This supports the performance evaluation done on different hardware models.
3.4 Execution Architecture Design-II

Intel’s tick tock approach[17] moves into a new manufacturing technology every two years (called tick), the cadence Moore’s law dictates. On the years in between, Intel updates the chip architecture but leaves the manufacturing process unchanged (called tock). From Figure 3.20 it can be seen that Intel roadmap has witnessed Nehalem, Sandy Bridge and Haswell micro-architectures. Each micro-architecture evolved includes enhancements in terms of power optimization, performance and many other properties from its former micro-architectures. The Allura system evolves with the Intel roadmap of processors. PCs with newer micro-architectures are integrated into the system. This motivates the comparison of PCs with different micro-architectures. Therefore, we compare the performance of two different micro-architectures, Nehalem and its successor, Sandy Bridge.
3.4.1 PC Architecture and Hardware Model

The two PCs used for study are of the type, Nehalem-EP\textsuperscript{9} and Sandy Bridge-EP\textsuperscript{10}. The hardware model and the physical attributes of the PC with Nehalem micro-architecture is described in Section 3.2. In this section, the hardware model and physical attributes of the PC with Sandy Bridge micro-architecture are presented.

The PC architecture consists of two CPUs running at 2 GHz. Each processor has 8 cores. Each core is connected to a private $L_1$ data and instruction cache of 32 KiB each and a private $L_2$ cache of 256 KiB. Each CPU is connected to an $L_3$ cache of 20 MiB which is shared among 8 cores. Every CPU supports upto 4 memory channels (bidirectional) and connects each of the 8 cores to an external memory of 32 GiB. Each Graphic card has an external memory of 2 GiB. Each socket has 40 lanes of integrated PCIe 3.0. The system is connected to 3 Graphic cards, 2 network cards and 15 grabber cards via PCIe. CPU 1 is connected to a SAS disk with a maximum capacity of 160 GiB (and 20000 rpm) via SAS LSI controller. CPU 1 is also connected to a SATA disk with a maximum capacity of 600 GiB (and 15000 rpm) via PCH. The maximum theoretical bandwidth to network card, graphic card, grabber card, SATA disk and SAS disk are tabulated in Table 3.13. This table also re-presents the hardware attributes of the Nehalem PC used for comparison. The hardware model\textsuperscript{18} of the PC with Sandy Bridge micro-architecture chosen is shown in Figure 3.21. An instantiated CPU representation is shown in Figure 3.22.

The application used for study is STREAM. The detailed description of STREAM is already presented in Section 3.2.2. The latency of STREAM on the PC is calculated using the model developed in Section 3.2.4. The assumptions and decisions made in Section 3.2 are adopted in this section. The definition of hardware metrics used for Nehalem hold good even for Sandy Bridge. Therefore, we use the set of models ($d_1, \ldots d_{10} \in D$) to evaluate the performance of this PC.

3.4.2 Comparison and Reflection

From the hardware models and hardware attributes described for Nehalem and Sandy Bridge, it can seen that there are many architectural differences. Nearly every aspect

\textsuperscript{9}Intel makes two categories of processor sockets for servers: Expandable capacity (EX) and Efficient Performance (EP) processor technology. The EP processors can run on servers with a maximum capacity of 1 or 2 sockets per server. The EX processors can run on servers with 1, 2, 4, 8, or more sockets per server.

\textsuperscript{10}Intel makes three categories of processors for Sandy Bridge: E with a maximum capacity of 1 socket (used in Desktop chips), EN with a capacity of 1 or 2 sockets per server and EP with a capacity of 1, 2, or 4 sockets per server.
of the Sandy Bridge micro-architecture has been redesigned to improve per-core performance and power efficiency [20]. It has improved CPU arithmetic, CPU multimedia, cryptography, power efficiency, media transcoding, memory controller speed and $L_3$ cache performances. However, this section presents major architectural advancements [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 18, 34, 35, 36] and highlights the most important architectural differences in the context of determining the performance of the PC in terms of the latency of the application chosen.

1. The Die: The major architectural changes on the die are described here. See Figure 3.22. Intel has devised innovative techniques not found on its predecessor, Nehalem. This includes:
Table 3.13: Hardware Attributes of the PCs: The hardware attributes in Table are from the device specification sheet[19, 18], BIOS settings and Si-Software Sandra.

<table>
<thead>
<tr>
<th>Attribute of machine $m$</th>
<th>Sandy Bridge-EP</th>
<th>Nehalem-EP</th>
</tr>
</thead>
<tbody>
<tr>
<td>$m.CPUmodel$</td>
<td>Intel Xeon CPU E5-2650</td>
<td>Intel Xeon CPU E5504</td>
</tr>
<tr>
<td>$m.Socket$</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$m.CPUfreq(GHz)$</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$m.Core$</td>
<td>16</td>
<td>8</td>
</tr>
<tr>
<td>$m.Thread$</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>$m.L_0(KiB)$</td>
<td>5.25</td>
<td>-</td>
</tr>
<tr>
<td>$m.L_1d(KiB)$</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td>$m.L_2(KiB)$</td>
<td>256</td>
<td>256</td>
</tr>
<tr>
<td>$m.L_3(MiB)$</td>
<td>40</td>
<td>8</td>
</tr>
<tr>
<td>$m.MemChannel$</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<td>$m.DIMM$</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$m.Mem(GiB)$</td>
<td>64</td>
<td>6</td>
</tr>
<tr>
<td>$m.MemFreq(MHz)$</td>
<td>1600</td>
<td>800</td>
</tr>
<tr>
<td>$m.TotalMemBw(GB/s)$</td>
<td>51.2</td>
<td>19.2</td>
</tr>
<tr>
<td>$m.QPIlink$</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>$m.TotalQPIBw(GB/s)$</td>
<td>32</td>
<td>19.6</td>
</tr>
<tr>
<td>$m.FClversion$</td>
<td>3.0</td>
<td>2.0</td>
</tr>
<tr>
<td>$m.Disk_{SAS}(GiB)$</td>
<td>160</td>
<td>-</td>
</tr>
<tr>
<td>$m.Disk_{SATA}(GiB)$</td>
<td>600</td>
<td>150</td>
</tr>
<tr>
<td>$m.GraphicCard$</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>$m.GpuMem(MiB)$</td>
<td>2048</td>
<td>512</td>
</tr>
<tr>
<td>$m.PCl{eGraphicBw}$ (Gbps)</td>
<td>40</td>
<td>80</td>
</tr>
<tr>
<td>$m.NetworkCard$</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$m.PCl{eNCBw}$ (Gbps)</td>
<td>2.5</td>
<td>2.5 or 10</td>
</tr>
<tr>
<td>$m.GrabberCard$</td>
<td>15</td>
<td>-</td>
</tr>
<tr>
<td>$m.PCl{eGrabberBw}$ (Gbps)</td>
<td>10</td>
<td>-</td>
</tr>
<tr>
<td>$m.InfiniBandCard$</td>
<td>-</td>
<td>1</td>
</tr>
<tr>
<td>$m.PCl{eICBw}$ (Gbps)</td>
<td>-</td>
<td>20</td>
</tr>
<tr>
<td>$m.TurboBoostversion$</td>
<td>2.0</td>
<td>-</td>
</tr>
<tr>
<td>$m.maxTDP(W)$</td>
<td>95</td>
<td>80</td>
</tr>
<tr>
<td>$m.lithography(nm)$</td>
<td>32</td>
<td>45</td>
</tr>
</tbody>
</table>

(a) Number of cores: There are 8 cores in each CPU in the Sandy Bridge PC and 4 cores in each CPU in the Nehalem PC. The number of cores was increased by Intel to be able to deliver improved performance for CPU intensive applications. As the application we chose is not CPU intensive, but hungry for memory, cache and their bandwidths, increasing the number of cores has an
adverse effect on the performance of the Sandy Bridge PC. This is explained in detail, later, in this section.

(b) \( L_0 \) cache: One of the most novel features of the Sandy Bridge micro-architecture is that, it includes a level 0 cache. This is also known as the micro-op cache. It is a subset of the \( L_1 \) instruction cache that contains fixed length decoded micro-ops, rather than raw bytes of variable length instructions. The key aims of this cache are to improve the performance of CPU intensive applications by improving the bandwidth available at the front-end of the CPU and remove decoding from the critical path. It also decreases the power consumption.

The size of \( L_0 \) cache is 5.25 KiB. The \( L_0 \) has a hit rate of about 80%.

(c) \( L_1 \) cache: The \( L_1 \) data cache in Sandy Bridge micro-architecture is redesigned to increase the bandwidth. It can sustain two 128 bit loads and one 128 bit store every cycle, unlike Nehalem which can sustain only one 128 bit load and one 128 bit store. This doubles the bandwidth available to the cache directly connected to the load and store buffers, \( L_1 \) cache.

(d) \( L_3 \) cache:

i. Location: Unlike the \( L_3 \) cache located in the un-core\(^{11}\) domain in the Nehalem PC, \( L_3 \) cache is directly connected to the cores in the processor clock domain on the Sandy Bridge PC.

ii. Capacity: The capacity of \( L_3 \) cache is 20 MiB which is 2.5 times the capacity of \( L_3 \) cache in the Nehalem PC. Another important difference between the two \( L_3 \) caches is that in Sandy Bridge micro-architecture, the \( L_3 \) cache is divided into 8 equally sized physical blocks(logically, there is one \( L_3 \) cache). This allows each block to service requests simultaneously making its bandwidth approximately 8 times larger.

(e) System Agent: Sandy Bridge contains a new component, the System Agent, that controls the memory controller, PCI Express, display interfaces,and the DMI connection to the external south-bridge chip (PCH). The un-core domain in the Nehalem micro-architecture is replaced by the system agent. The effect of System Agent is not significant in this context.

(f) Ring Interconnect:The Sandy Bridge micro-architecture employs a ring interconnect between the cores, graphics, \( L_3 \) cache and system agent (including the display/media engine) which replaces the internal buses used in the Nehalem

\(^{11}\)The uncore is a term used by Intel to describe the functions of a microprocessor that are not in the Core, but which are essential for Core performance, such as, \( L_3 \) cache, power control unit, on die memory controller, QPI controller. Un-core frequency is lower than the core frequency.
micro-architecture. The coherent bidirectional ring is composed of four different rings: request, snoop, acknowledge and a 32 B wide data ring. Together these four rings are responsible for a distributed communication protocol that enforces coherency and ordering. The rings are fully pipelined and run at the core clock and voltage. The $L_3$ cache and the data rings in the system fabric run at the core frequency providing upto 844 GB/s of fabric bandwidth. However, the $L_3$ cache in Sandy Bridge is shared by the cores, the integrated GPU and the system agent. Each of these agents accesses the $L_3$ cache via the ring. The bandwidth scales with the agents. But, the scaling is not necessarily perfect though, because of the topology. As messages travel across the ring, they can block access to other agents, reducing the available bandwidth as the average hop count (e.g, number of cores) increases. Therefore, $L_3$ cache is no longer a single unified entity as in Nehalem, but is instead distributed and partitioned for higher bandwidth and associativity.

From the factors mentioned, it can be summarised that, first, each slice of the $L_3$ (20 MiB/8=2.5 MiB)cache in the Sandy Bridge micro-architecture is much smaller (≈ 3 times) than Nehalem’s 8 MiB $L_3$ cache, so the latency to access the tags and data arrays has decreased. Second, the ring and $L_3$ now reside in the same clock and voltage domain as the cores (and the core clock is certainly faster than the un-core clock in Nehalem). In Nehalem micro-architecture, there is a latency penalty for signals crossing to a new voltage and clock domain. This penalty is determined by the ratio between the two frequencies and can be several cycles. Placing the caches and ring in the same domain reduces this latency.

Note that the latency of the ring will increase as more agents are attached; each hop on the ring takes 1 cycle, so the latency actually depends on the relative position of the requesting core and the receiving cache slice. The bandwidth available also depends on the blocking caused due to other agents. Nevertheless, the access time to $L_3$ cache is reduced by 25% (40 clock cycles in Nehalem to 30 clock cycles in Sandy Bridge).

(g) GPU on the die: GPU is directly connected to the die in the Sandy Bridge PC. This improves the speed of the components used in GPU. The GPU functionalities are enhanced and redesigned for better performance. As this is beyond the scope of this thesis, it is not discussed in detail.

(h) Other changes: As discussed earlier, every aspect of Sandy Bridge PC is redesigned to improve the performance. Some of the changes are: improved branch prediction, increased number of load and store buffers, efficient techniques for tracking and renaming micro-ops in flight, introduction of the AVX instruction set for improved floating point performance. As these finer details cannot be measured using the tools we use, they are not discussed in detail.
2. Memory: The memory outlay, frequency and configuration in Sandy Bridge are the most important factors affecting the performance in this context.

   (a) Frequency: The memory frequency doubles (from 800 MHz in Nehalem to 1600 MHz in Sandy Bridge) improving the access time to memory by 100%.

   (b) Memory Channels: The number of memory channels increases (maximum of 3 channels in Nehalem to 4 channels in Sandy Bridge), thereby increasing the memory bandwidth by 33%. However, we consider the best possible configuration supported on the Nehalem PC with 4 GiB of memory configured and 2 memory channels used for our study. Therefore, the memory bandwidth available on a Sandy Bridge PC is twice of that available on the Nehalem PC.

   (c) Capacity: The maximum memory configuration used in the Nehalem PC is 6 GiB. On the other hand, Sandy Bridge has a total of 64 GiB of memory installed in the system. This makes the number of disk accesses very low.

3. QPI links:

   (a) Number of QPI links: The number of QPI links between the sockets is doubled (from 1 bidirectional link in Nehalem to 2 bidirectional link in Sandy Bridge). The number of QPI links in Sandy Bridge micro-architecture is increased to ensure scalability in servers that support many sockets. However, the PC we used for study supports two sockets only. Also, from the experiments conducted on the Nehalem PC, it was seen that the % of QPI bandwidth used ranges from 10% to 30%. Therefore, we ignore the effect of increased number of QPI links in our study.

   (b) QPI bandwidth: The bandwidth of each QPI link in the Sandy Bridge PC is 16 GB/s which is approximately 1.6 times the QPI bandwidth supported in the Nehalem PC (9.8 GB/s). This plays a significant role in improving the performance of the system.

   (c) QPI protocol: QPI 1.1 used in Sandy Bridge is different from QPI 1.0 link used in Nehalem, both, on the physical organisation level and the protocol implementation. The primary change is that QPI 1.1 uses home based snooping technique unlike Nehalem which uses source based snooping technique. In source snooping, the requesting processor that missed in the $L_3$ cache broadcasts a snoop request to the entire system. Other caching agents (i.e., anything with a cache, such as another processor) may fulfil the snoop request if they hold a cached copy of the data. The home agent (i.e., the memory controller that owns the data) will respond to the snoop with a clean copy of the cache line if necessary. The home node still receives all of the acknowledgements from the caching agents and if a conflict occurs, will resolve
transactions in the correct order. In home snooping, the requesting processor sends a request to the home agent. Second, the home agent will send snoop requests only to the caching agents that have a copy of the data (filtering) and possibly begin reading the cache line from memory. Lastly, the home node and (or) any caching agents will send data to the original requester.

The coherency management resides with the home agent in home snooping, which makes it simpler. Source snooping results in lower latency, especially when the requested cache line is held in remote memory and remote cache. However, home snooping is a more natural fit for inter-socket snoop filtering in servers with many sockets. We focus on two socket systems only. Therefore, we ignore the benefits of home snooping protocol.

4. IO Hubs: One of the most notable, and interesting changes in the hardware models of the two types of PCs presented is the removal of IO Hubs in the Sandy Bridge micro-architecture. The PCIe slots are directly connected to the die providing faster access and improved scalability. The system agent provides a direct connection from the CPU1 to the PCH that connects to the SATA disks, USB devices, network cards and grabber cards. A direct interface from CPU1 to SAS ports is supported by SAS LSI.

These features are remarkably different in Nehalem micro-architecture, where, the IO Hubs connect the PCIe slots, network, graphic cards to the CPUs. An ICH is connected to the IO Hub 1 which provides interface to the SATA disks used. The QPI links are used to connect the CPUs to the IO hubs.

Though the changes are notable, they do not contribute to performance improvement in the context of this project because the application we chose does not exercise the network card, graphic card and their links. As the memory installed is 64 GiB and the OS used is Windows 7 (64 bit), all of the memory is available to the OS and accounts for no disk accesses.

5. PCIe: The Sandy Bridge PC integrates PCI-Express 3.0 unlike Nehalem which uses PCI-Express 2.0. PCIe 3.0 provides higher bandwidth and better data encoding techniques. The application we chose does not exercise the PCIe. Therefore, we do not delve into further details.

6. Turbo Boost\[12\]: The Sandy Bridge PC we used for study uses Turbo Boost 2.0 as against the Nehalem PC which does not use this technology. In the Sandy Bridge PC, the processor and its thermal system cools down when the CPU is idle. It uses these thermal reserves and dynamically ramps the frequency and the supply

---

\[12\]Intel Turbo Boost is a technology implemented by Intel in certain processors that enables the processor to run above its base operating frequency via dynamic control of the CPU’s clock rate. It is activated when the operating system requests the highest performance state of the processor.
voltage to operate above its rated TDP\textsuperscript{13} (for about 25 seconds according to Intel) until the thermal system heats up again in a situation dependent manner.

Note that the Turbo Boost feature can be disabled / enabled using the BIOS settings in a PC. In our study, the Turbo Boost technology is enabled on the Sandy Bridge PC.

The hardware attributes of both the PCs are listed in Table \ref{tab:hardware_attributes}. It entails the similarities and differences between the two PCs using numerical values.

To summarize this, the differences observed and the performance improvement or degradation expected (when using STREAM) due to these differences in the two PCs are presented in Table \ref{tab:performance_effect_L1}, Table \ref{tab:performance_effect_L3}, Table \ref{tab:performance_effect_memory}

<table>
<thead>
<tr>
<th>Changing Factor</th>
<th>Nehalem</th>
<th>Sandy Bridge</th>
<th>Effect on the Performance of Sandy Bridge</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bandwidth</td>
<td>1 load and 1 store</td>
<td>4 load and 1 store</td>
<td>Handwritten improvement</td>
</tr>
<tr>
<td>Total Effect</td>
<td></td>
<td></td>
<td>$t_{\text{hit1}} = \frac{1}{2} \cdot t_{\text{hit1}}$ for Nehalem = $\frac{20}{2} = 1$ ns.</td>
</tr>
</tbody>
</table>

Table 3.14: Effect on the Performance due to $L_1$ cache

<table>
<thead>
<tr>
<th>Changing Factor</th>
<th>Nehalem</th>
<th>Sandy Bridge</th>
<th>Effect on the Performance of Sandy Bridge</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L_3$ cache: Capacity per core(MiB)</td>
<td>8</td>
<td>2.5</td>
<td></td>
</tr>
<tr>
<td>Location</td>
<td>In the un-core domain</td>
<td>In the processor clock domain</td>
<td></td>
</tr>
<tr>
<td>Connection to other agents</td>
<td>Internal Bus</td>
<td>Ring Interconnect in the processor clock domain</td>
<td></td>
</tr>
<tr>
<td>Total Effect</td>
<td></td>
<td></td>
<td>$t_{\text{local}}$ reduces by 25 %. ( t_{\text{local}} = t_{\text{local}} ) in Nehalem - (( \frac{t_{\text{local}} \text{ in Nehalem}}{4} )) = 20 - (20/4) = 15 ns</td>
</tr>
</tbody>
</table>

Table 3.15: Effect on the Performance due to $L_3$ cache

<table>
<thead>
<tr>
<th>Changing Factor</th>
<th>Nehalem</th>
<th>Sandy Bridge</th>
<th>Effect on the Performance of Sandy Bridge</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory Frequency(MHz)</td>
<td>800</td>
<td>1600</td>
<td>$t_{\text{mem}}$ reduces</td>
</tr>
<tr>
<td>Memory Channels</td>
<td>2</td>
<td>4</td>
<td>$t_{\text{mem}}$ reduces</td>
</tr>
<tr>
<td>Total Effect</td>
<td></td>
<td></td>
<td>$(\frac{m.\text{MemChannels} \times m.\text{Cores}}{m.\text{MemChannels} \times m.\text{Cores} \times m.\text{MemFreq for SandyBridge}} \times m.\text{MemChannels} \times m.\text{Cores} \times m.\text{MemFreq for Nehalem}} = \frac{(4/8) \times 1600}{2} \times t_{\text{mem,local}} = \frac{1}{2} \times t_{\text{mem,local}}$ in Nehalem.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>$t_{\text{mem,local}} = \frac{65}{4} = 16.25$ ns</td>
</tr>
</tbody>
</table>

Table 3.16: Effect on the Performance due to Memory

\textsuperscript{13}The thermal design power (TDP), sometimes called thermal design point, refers to the maximum amount of power the cooling system in a computer is required to dissipate. It is typically the maximum power that it would draw when running real applications. The lower it is, the better.
Table 3.17: Effect on the Performance due to QPI

<table>
<thead>
<tr>
<th>Changing Factor</th>
<th>Nehalem</th>
<th>Sandy Bridge</th>
<th>Effect on the Performance of Sandy Bridge</th>
</tr>
</thead>
<tbody>
<tr>
<td>QPI Bandwidth per link (GB/s)</td>
<td>9.8</td>
<td>16</td>
<td>Handwidth increases</td>
</tr>
<tr>
<td>Total Effect</td>
<td></td>
<td></td>
<td>m.QPIBw for Sandy Bridge = 16/9.8 = 1.63. (\therefore) QPI latency reduces to 25 ns.</td>
</tr>
</tbody>
</table>

3.5 Experiments and Results-II

This section gives a detailed description of the experiments conducted on the Sandy Bridge PC. It describes Series of experiments done to compare Nehalem and Sandy Bridge micro-architectures. In Section 3.3, the experiments done on the Nehalem PC are presented. The models, \(d_1, \ldots, d_{10}\), notations and the experimental results presented in Table 3.21, Table 3.22, Table 3.23, Table 3.24 are used in this section for comparison.

The series of steps illustrated in Figure 3.10 are adopted in this section. The hardware specifications of the PC on which the experiments are conducted are tabulated in Table 3.13 and the parameters chosen for STREAM are presented in Table 3.6. The procedure of conducting experiments are adopted from Section 3.3.1. The OS used on the Sandy Bridge PC is Windows 7. The list of measured metrics are also the same on the Sandy Bridge PC as listed in Table 3.7, except for QPI sock. The QPI traffic between the sockets cannot be measured on the Sandy Bridge PC using the Intel PCM.

3.5.1 Assumptions and Calculations

The effect on the metrics due to the change in the micro-architecture that are relevant to the model presented in Table 3.14, Table 3.15, Table 3.16, Table 3.17 are used to calculate the local and remote access times to \(L_1, L_2, L_3\) caches, memory and disk. This is tabulated in Table 3.18. This in turn, is used to calculate the total access times to the caches and memory for different array sizes. All these are tabulated in Table 3.20.

The hardware model of the PC with Sandy Bridge micro-architecture is denoted as \(\hat{H}m_5\). The metric, RAR represented by \(d_1\) in Section 3.2.4 is assumed to be 0.20. This assumption is required because the Intel PCM does not capture QPI metrics on the PCs with Sandy Bridge micro-architecture. However, the total memory is symmetrically distributed between the two sockets and the OS used is Windows 7 (64 bit). In Section 3.3.3 and Table 3.12, we have seen that for hardware models with symmetric memory configuration on both sockets, the average RAR is approximately 0.20. This is demonstrated using Figure 3.23. It can be seen that for \(\hat{H}m_2\) and \(\hat{H}m_3\) the RAR is approximately 0.20 and for \(\hat{H}m_1\) and \(\hat{H}m_4\) (asymmetric memory configuration), the value
Table 3.18: Access Times to Cache and Memory

<table>
<thead>
<tr>
<th>Cache Level /Memory</th>
<th>Local</th>
<th>Remote</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L_1 (ns)$</td>
<td>$t_{hit1}$</td>
<td>$t_{hit1}$</td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>$L_2 (ns)$</td>
<td>$t_{2local}$</td>
<td>$t_{2remote}$</td>
</tr>
<tr>
<td></td>
<td>6</td>
<td>6+25=31</td>
</tr>
<tr>
<td>$L_3 (ns)$</td>
<td>$t_{3local}$</td>
<td>$t_{3remote}$</td>
</tr>
<tr>
<td></td>
<td>15</td>
<td>15+25=40</td>
</tr>
<tr>
<td>DRAM (ns)</td>
<td>$t_{memlocal}$</td>
<td>$t_{memremote}$</td>
</tr>
<tr>
<td></td>
<td>32.5</td>
<td>32.5 + 25=57.5</td>
</tr>
<tr>
<td>Disk (ms)</td>
<td>$t_{disk}$</td>
<td>$t_{disk}$</td>
</tr>
<tr>
<td></td>
<td>2</td>
<td>2</td>
</tr>
</tbody>
</table>

of RAR is approximately 0.50. Intel also uses this principle to define various metrics in PCM [37]. Therefore, we chose RAR=0.20 for all experiments conducted on the PC with Sandy Bridge micro-architecture. This value remains constant for all array sizes.

The $L_1 hitRatio$ cannot be measured using the tools selected (same like Nehalem). Therefore, we chose $L_1 HitRatio=0.94$ for the experiments. The values chosen for $L_1 HitRatio$ gives the best regression fit for every experiment. All these are tabulated in Table 3.20. Looking back at Figure 3.10, $L_1 HitRatio$ is a part of the curve fitting.

![Figure 3.23: Remote Access Ratio for Nehalem and Sandy Bridge micro-architecture: The RAR of all the hardware models of the PC with Nehalem micro-architecture and the PC with Sandy Bridge micro-architecture](image)

The metrics assumed and calculated are summarised in Table 3.19.
Table 3.19: Measured Metrics from the tools

<table>
<thead>
<tr>
<th>Assumed Metrics</th>
<th>Calculated Metrics</th>
</tr>
</thead>
<tbody>
<tr>
<td>$t_{2local}$</td>
<td>$t_{hit1}$</td>
</tr>
<tr>
<td>RAR</td>
<td>$L_1Hit$</td>
</tr>
<tr>
<td>$t_{disk}$</td>
<td>$t_{3local}$</td>
</tr>
<tr>
<td></td>
<td>$L_2Hit$</td>
</tr>
<tr>
<td></td>
<td>$L_3Hit$</td>
</tr>
<tr>
<td></td>
<td>$t_{memlocal}$</td>
</tr>
<tr>
<td></td>
<td>$t_{memremote}$</td>
</tr>
<tr>
<td></td>
<td>$t_{hit2}$</td>
</tr>
<tr>
<td></td>
<td>$t_{hit3}$</td>
</tr>
<tr>
<td></td>
<td>$t_{mem}$</td>
</tr>
</tbody>
</table>

3.5.2 Prediction and Verification

From the metrics captured using the tools, metrics assumed and metrics calculated, a profile, $K5$ is created for $Hm_5$. This is used to predict a metric, $T_{tot}$ for all the experiments. $T_{tot}$ is predicted as follows:

$$T_{tot} = F(K5, Hm_5)$$

The estimated (predicted) latency, $T_{tot}$ is verified by using $t_{app} \notin K5$. This is done as follows.

$$|T_{tot} - t_{app}| \leq \text{error} \quad (3.16)$$

In order to determine if the predicted ($T_{tot}$) and measured ($t_{app}$) values of latencies are comparable, the error percentage is calculated as follows:

$$\text{error} (%) = \frac{T_{tot} - t_{app}}{t_{app}} \times 100 \quad (3.17)$$

From Table 3.20, it is seen that $T_{tot}$ is compared with $t_{app}$. The average absolute value of the error percentage of all array sizes is 15 %. This verifies the predictive quality of $d1, \ldots d10$ for $Hm_5$ and subsequently, verifies the choice of access times and RAR metrics assumed.

From Figure 3.24, it is seen that the measured and estimated latencies are coherent with respect to each other for all array sizes chosen during experiments.
From Figure 3.10, models $d_1, \ldots, d_{10}$ can be formalized. Also, Looking back at Figure 2.7, the models are validated on a machine with entirely different micro-architecture.

Although the latencies predicted and measured are comparable, they do not follow a smooth path, but instead show noticeable variations. On increasing the array size, the latency of the application is expected to increase. However, from Figure 3.24, it can be seen that, not all array sizes show this behaviour. This can be explained using the Active Relative Frequency (ARF) metric for the system captured using the MeasureLoadCLI tool. ARF is defined as the ratio of the frequency at which the system runs to the base clock frequency of the system. An ARF value greater than 1 indicates that the Turbo Boost technology is enabled on the system. In Figure 3.25, the estimated latency of the application for all experiments conducted is plotted against the ARF of all the processors captured (expressed as percentage). It can be seen that the ARF varies from 110% to 119%. The green blocks indicate performance boost achieved due to the ramping up of the processors’ frequency. Performance boost in this context means that the latency of the application does not increase when the array size is increased. It remains constant or decreases with respect to its former experiments (where smaller array sizes were chosen). From the first green box, it can be seen that the latency estimated does not increase when the array size is increased from 1 million elements to 2 million elements, but instead remains to be 3 s. This is because the ARF changes from 110% to 114%. The ramping up of the frequency of the processors in the system accounts for this performance boost observed. With increased frequency, processors are faster and this explains lower latencies for higher array size. Similarly, note that for every occurrence of performance boost, there is a corresponding increase in the ARF.

In STREAM, the highest array size that can be chosen is about 80 million elements. It is a memory bound application.
3.5.3 Performance Evaluation using models

The performance of the hardware model with Sandy Bridge micro-architecture, $H\hat{m}_5$, is compared with all the hardware models with Nehalem micro-architecture. From Figure 3.26, it is seen that the latency of STREAM estimated on the PC with Sandy Bridge micro-architecture is lower than the latencies estimated for all hardware models of the PC with Nehalem micro-architecture. Sandy Bridge definitely exhibits higher performance than Nehalem.

The performance gain of $H\hat{m}_5$ is estimated with respect to $H\hat{m}_1$, $H\hat{m}_2$, $H\hat{m}_3$, $H\hat{m}_4$ and shown using a state diagram representation.

From Figure 3.27, it is seen that the PC with Sandy Bridge micro-architecture shows a performance improvement of 26% from the hardware model that exhibits the best performance on the PC with Nehalem micro-architecture, $H\hat{m}_3(4$GiB memory configu-
Figure 3.27: Performance Evaluation: The average value of ratios of the difference between the latencies of two micro-architectures to the latency of one micro-architecture for all array sizes is shown. The text highlighted in green indicates performance improvement.

It shows a performance improvement of 36% from the hardware model, \(H\hat{m}_4\) (6 GiB), a performance improvement of 41% from \(H\hat{m}_2\) and 63% from \(H\hat{m}_1\). From equation (3.13), we already expect that the performance improvement of \(H\hat{m}_5\) is the highest with respect to \(H\hat{m}_1\) and then followed by \(H\hat{m}_2\), \(H\hat{m}_4\) and \(H\hat{m}_3\) in the same order as stated. This can also be seen in Figure 3.26.

From Figure 3.26 and Figure 3.27, it can be inferred that:

\[
H\hat{m}_5 > H\hat{m}_3 > H\hat{m}_4 > H\hat{m}_2 > H\hat{m}_1
\] (3.18)

### 3.5.4 Performance Evaluation using STREAM

The latency of the vector kernel operations and the memory bandwidth recorded by STREAM shown in Figure 3.28 and Figure 3.29 is used to support the performance evaluation done on the Sandy Bridge and Nehalem PCs. This is done as discussed in Section 3.3.5.

From Figure 3.28, the performance of the all hardware models can be summarized as:

\[
H\hat{m}_5 > H\hat{m}_3 > H\hat{m}_4 > H\hat{m}_2 > H\hat{m}_1
\] (3.19)

From Figure 3.29, the performance of the PCs for all hardware models in terms of memory bandwidth can be summarized as:

\[
H\hat{m}_5 > H\hat{m}_3 > H\hat{m}_4 > H\hat{m}_2 > H\hat{m}_1
\] (3.20)

The memory bandwidth recorded for \(H\hat{m}_5\) is twice as much the memory bandwidth recorded for \(H\hat{m}_3\). This can be explained using the analysis done in Table 3.16.
Equation 3.18 derived from modeling concurs with Equation 3.20 and Equation 3.19 derived from STREAM.

From the analysis done so far, for the PCs chosen, and models created, Sandy Bridge micro-architecture out-performs Nehalem micro-architecture.
3.6 Lessons Learned

This section lists some of the most important observations made when creating the models. They are:

1. Effect of Interleaved memory access: When the number of memory channels are increased in the architecture of a PC, the performance increases because of interleaved memory access principle used in accessing the data from the memory. In interleaved memory, memory addresses are allotted to alternate memory banks. For example, in a processor with 4 memory channels, the first 64 bytes is allotted to memory bank connected to channel 1, the next 64 bytes is allotted to the memory bank connected to channel 2 and so on. Memory access, which according to the locality principle mainly happens in adjacent memory areas, is thus distributed across all channels. When the data is accessed in parallel through all the memory channels, there is a definite performance gain. Therefore, the memory bandwidth increases and latency decreases with the increase in the number of memory channels.

2. Verification of models: The models, $d_1, \ldots, d_{10}$ created by using the Nehalem micro-architecture is verified using a measured metric not used in the model. The error percentage calculated between the predicted and measured metric is less than 1% for the cases when the model is valid.

3. Cost Reduction: From the performance evaluation done on the PC with Nehalem micro-architecture, if the OS used is Windows XP, then the performance of the PC with 4 GiB memory is better than the performance of the same PC with 6 GiB of memory. Unused memory can be removed from the PC to achieve cost reduction.

4. Validation of models: The models, $d_1, \ldots, d_{10}$ are validated by testing the predictive quality of the models on an entirely different machine, with a different micro-architecture. The average value of the absolute error percentages for all array sizes in the experiments is 15%. It is also to be noted that the machines have to be physically available for the purpose of validation for the models created. Being able to predict the performance of the machines which are not available, but whose specifications are known is the next step. This is future work.

5. Usefulness of STREAM: The STREAM benchmark program generates huge amount of traffic on the memory buses. Though the ultimate aim of modeling is being able to use it in real time applications, STREAM represents memory intensive operations. As a next step, the models created can be tested on a machine with real time memory intensive operations.
Table 3.20: Experimental results of \( \hat{H}_m \): Sandy Bridge micro-architecture

<table>
<thead>
<tr>
<th>Metrics</th>
<th>N(millions)</th>
<th>1</th>
<th>2</th>
<th>4</th>
<th>8</th>
<th>16</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>256</th>
<th>512</th>
<th>1024</th>
</tr>
</thead>
<tbody>
<tr>
<td>( L_2 ) HitRatio</td>
<td>0.9976</td>
<td>0.9903</td>
<td>0.9740</td>
<td>0.9400</td>
<td>0.9400</td>
<td>0.9400</td>
<td>0.9400</td>
<td>0.9400</td>
<td>0.9400</td>
<td>0.9400</td>
<td>0.9400</td>
<td></td>
</tr>
<tr>
<td>( L_3 ) HitRatio</td>
<td>1.0</td>
<td>0.99</td>
<td>0.97</td>
<td>0.95</td>
<td>0.93</td>
<td>0.92</td>
<td>0.91</td>
<td>0.90</td>
<td>0.89</td>
<td>0.88</td>
<td>0.87</td>
<td>0.86</td>
</tr>
<tr>
<td>RemoteAccessRatio</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
<tr>
<td>( L_1 ) HitRatio(10^3)</td>
<td>2.8</td>
<td>10.5</td>
<td>10.3</td>
<td>10.1</td>
<td>9.9</td>
<td>9.7</td>
<td>9.5</td>
<td>9.3</td>
<td>9.1</td>
<td>8.9</td>
<td>8.7</td>
<td>8.5</td>
</tr>
<tr>
<td>( L_2 ) HitRatio(10^6)</td>
<td>2900</td>
<td>2400</td>
<td>2100</td>
<td>1800</td>
<td>1500</td>
<td>1200</td>
<td>900</td>
<td>600</td>
<td>300</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>DiskHit</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>MemHit</td>
<td>1.9</td>
<td>11.4</td>
<td>42.4</td>
<td>128.4</td>
<td>408.5</td>
<td>415.5</td>
<td>412.35</td>
<td>409.4</td>
<td>471.0</td>
<td>488.3</td>
<td>463.6</td>
<td>482.5</td>
</tr>
<tr>
<td>( L_2 ) Miss(10^6)</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
<td>37.5</td>
</tr>
<tr>
<td>( L_3 ) Miss(10^6)</td>
<td>28.01</td>
<td>4.03</td>
<td>4.31</td>
<td>7.98</td>
<td>14.44</td>
<td>14.91</td>
<td>22.99</td>
<td>25.43</td>
<td>26.07</td>
<td>26.07</td>
<td>26.07</td>
<td>26.07</td>
</tr>
<tr>
<td>RemoteAccessHit</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
<tr>
<td>( L_1 ) Miss(10^3)</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
<td>11.0</td>
</tr>
<tr>
<td>( L_2 ) Miss(10^6)</td>
<td>0.22</td>
<td>-0.75</td>
<td>13.76</td>
<td>11.34</td>
<td>-11.05</td>
<td>-14.66</td>
<td>-14.93</td>
<td>-15.37</td>
<td>-8.81</td>
<td>-16.71</td>
<td>-22.40</td>
<td>-10.37</td>
</tr>
<tr>
<td>( L_3 ) Miss(10^6)</td>
<td>0.22</td>
<td>-0.75</td>
<td>13.76</td>
<td>11.34</td>
<td>-11.05</td>
<td>-14.66</td>
<td>-14.93</td>
<td>-15.37</td>
<td>-8.81</td>
<td>-16.71</td>
<td>-22.40</td>
<td>-10.37</td>
</tr>
</tbody>
</table>

Table 3.20: Experimental results of \( \hat{H}_m \): Sandy Bridge micro-architecture
### Table 3.21: Experimental results with 1 GiB Memory Configuration ($H\text{m}_1$): Nehalem micro-architecture

<table>
<thead>
<tr>
<th>Metrics</th>
<th>1</th>
<th>2</th>
<th>4</th>
<th>8</th>
<th>16</th>
<th>20</th>
<th>24</th>
<th>28</th>
<th>30</th>
<th>32</th>
<th>34</th>
<th>36</th>
<th>37</th>
</tr>
</thead>
<tbody>
<tr>
<td>L\text{HitRatio}</td>
<td>0.4238</td>
<td>0.6951</td>
<td>0.6900</td>
<td>0.6979</td>
<td>0.7100</td>
<td>0.7168</td>
<td>0.7199</td>
<td>0.7206</td>
<td>0.7214</td>
<td>0.7221</td>
<td>0.7228</td>
<td>0.7235</td>
<td>0.7239</td>
</tr>
<tr>
<td>L\text{Miss}</td>
<td>0.5708</td>
<td>0.3049</td>
<td>0.3090</td>
<td>0.3021</td>
<td>0.2932</td>
<td>0.2888</td>
<td>0.2869</td>
<td>0.2871</td>
<td>0.2878</td>
<td>0.2886</td>
<td>0.2891</td>
<td>0.2895</td>
<td>0.2897</td>
</tr>
<tr>
<td>\text{QPI}</td>
<td>1.0825</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
</tr>
<tr>
<td>\text{DiskHit}</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
</tr>
<tr>
<td>\text{tmem}</td>
<td>20500</td>
<td>78.70</td>
<td>19100</td>
<td>1100</td>
<td>90.80</td>
<td>39.70</td>
<td>19.70</td>
<td>80.47</td>
<td>88.90</td>
<td>53.64</td>
<td>43.00</td>
<td>-233.31</td>
<td>36.10</td>
</tr>
<tr>
<td>\text{error}</td>
<td>0.4437</td>
<td>197.90</td>
<td>44.30</td>
<td>0.4273</td>
<td>36.10</td>
<td>0.4548</td>
<td>37.80</td>
<td>4.47</td>
<td>20500</td>
<td>78.70</td>
<td>19100</td>
<td>1100</td>
<td>90.80</td>
</tr>
<tr>
<td>\text{HitRatio}</td>
<td>0.4238</td>
<td>0.6951</td>
<td>0.6900</td>
<td>0.6979</td>
<td>0.7100</td>
<td>0.7168</td>
<td>0.7199</td>
<td>0.7206</td>
<td>0.7214</td>
<td>0.7221</td>
<td>0.7228</td>
<td>0.7235</td>
<td>0.7239</td>
</tr>
<tr>
<td>L\text{Miss}</td>
<td>0.5708</td>
<td>0.3049</td>
<td>0.3090</td>
<td>0.3021</td>
<td>0.2932</td>
<td>0.2888</td>
<td>0.2869</td>
<td>0.2871</td>
<td>0.2878</td>
<td>0.2886</td>
<td>0.2891</td>
<td>0.2895</td>
<td>0.2897</td>
</tr>
<tr>
<td>\text{QPI}</td>
<td>1.0825</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
<td>1.0835</td>
</tr>
<tr>
<td>\text{DiskHit}</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
<td>1.3103</td>
</tr>
<tr>
<td>\text{tmem}</td>
<td>20500</td>
<td>78.70</td>
<td>19100</td>
<td>1100</td>
<td>90.80</td>
<td>39.70</td>
<td>19.70</td>
<td>80.47</td>
<td>88.90</td>
<td>53.64</td>
<td>43.00</td>
<td>-233.31</td>
<td>36.10</td>
</tr>
<tr>
<td>\text{error}</td>
<td>0.4437</td>
<td>197.90</td>
<td>44.30</td>
<td>0.4273</td>
<td>36.10</td>
<td>0.4548</td>
<td>37.80</td>
<td>4.47</td>
<td>20500</td>
<td>78.70</td>
<td>19100</td>
<td>1100</td>
<td>90.80</td>
</tr>
</tbody>
</table>

Notes: The table shows the experimental results for the Nehalem micro-architecture with 1 GiB Memory Configuration ($H\text{m}_1$). The metrics include hit ratio, miss ratio, QPI, disk hit rate, and memory access time (tmem). The error values are also provided.
Table 3.22: Experimental results with 2 GiB Memory Configuration ($H_mq$): Nehalem micro-architecture

<table>
<thead>
<tr>
<th>Metrics</th>
<th>1</th>
<th>2</th>
<th>4</th>
<th>8</th>
<th>12</th>
<th>16</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>256</th>
<th>512</th>
<th>1024</th>
</tr>
</thead>
<tbody>
<tr>
<td>HitRatio</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
<tr>
<td>Miss</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
<td>0.2</td>
</tr>
<tr>
<td>Metrics</td>
<td>N(millions)</td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>56</td>
<td>60</td>
<td>64</td>
<td>68</td>
<td>70</td>
</tr>
<tr>
<td>------------------</td>
<td>-------------</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
<td>----</td>
</tr>
<tr>
<td>L3 HitRatio</td>
<td></td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
<td>0.284</td>
</tr>
<tr>
<td>L3 HitRatio</td>
<td></td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
<td>0.283</td>
</tr>
<tr>
<td>LpMiss(10^6)</td>
<td></td>
<td>2.7</td>
<td>3.1</td>
<td>7.8</td>
<td>12.4</td>
<td>21.2</td>
<td>40.5</td>
<td>65.3</td>
<td>73.6</td>
<td>80.1</td>
<td>87.1</td>
<td>94.7</td>
</tr>
<tr>
<td>LpMiss(10^6)</td>
<td></td>
<td>3.0</td>
<td>2.9</td>
<td>7.0</td>
<td>11.4</td>
<td>21.2</td>
<td>40.5</td>
<td>65.3</td>
<td>73.6</td>
<td>80.1</td>
<td>87.1</td>
<td>94.7</td>
</tr>
<tr>
<td>MemHit(10^6)</td>
<td></td>
<td>1.14</td>
<td>1.14</td>
<td>42.8</td>
<td>76.1</td>
<td>146.5</td>
<td>284.4</td>
<td>395.9</td>
<td>420.8</td>
<td>450.4</td>
<td>492.5</td>
<td>548.1</td>
</tr>
<tr>
<td>MemHit(10^6)</td>
<td></td>
<td>1.386</td>
<td>1.185</td>
<td>0.7444</td>
<td>4.2</td>
<td>9.4</td>
<td>22.6</td>
<td>88.1</td>
<td>100.9</td>
<td>112.7</td>
<td>114.8</td>
<td>126.1</td>
</tr>
<tr>
<td>DiskHit</td>
<td></td>
<td>4</td>
<td>4</td>
<td>3.9</td>
<td>3.9</td>
<td>3.8</td>
<td>3.5</td>
<td>3.5</td>
<td>3.4</td>
<td>3.3</td>
<td>3.2</td>
<td>3.1</td>
</tr>
<tr>
<td>MemHit(10^6)</td>
<td></td>
<td>2.3</td>
<td>2.7</td>
<td>7.0</td>
<td>11.4</td>
<td>21.2</td>
<td>40.5</td>
<td>65.3</td>
<td>73.6</td>
<td>80.1</td>
<td>87.1</td>
<td>94.7</td>
</tr>
<tr>
<td>L3 HitRatio</td>
<td></td>
<td>0.9909</td>
<td>0.9904</td>
<td>0.9943</td>
<td>0.9977</td>
<td>0.9997</td>
<td>0.9997</td>
<td>0.9997</td>
<td>0.9997</td>
<td>0.9997</td>
<td>0.9997</td>
<td>0.9997</td>
</tr>
<tr>
<td>RemoteAccessRatio</td>
<td></td>
<td>0.0122</td>
<td>0.0083</td>
<td>0.0174</td>
<td>0.0349</td>
<td>0.0618</td>
<td>0.0960</td>
<td>0.1225</td>
<td>0.1397</td>
<td>0.2502</td>
<td>0.2688</td>
<td>0.2475</td>
</tr>
<tr>
<td>NumDelays(Ns)</td>
<td></td>
<td>0.5</td>
<td>0.3</td>
<td>0.7</td>
<td>8.2</td>
<td>8.8</td>
<td>9.2</td>
<td>14.9</td>
<td>15.6</td>
<td>16</td>
<td>16.8</td>
<td>15.9</td>
</tr>
<tr>
<td>NumDelays(Ns)</td>
<td></td>
<td>0.5</td>
<td>0.3</td>
<td>0.7</td>
<td>8.2</td>
<td>8.8</td>
<td>9.2</td>
<td>14.9</td>
<td>15.6</td>
<td>16</td>
<td>16.8</td>
<td>15.9</td>
</tr>
<tr>
<td>NumDelays(Ns)</td>
<td></td>
<td>0.5</td>
<td>0.3</td>
<td>0.7</td>
<td>8.2</td>
<td>8.8</td>
<td>9.2</td>
<td>14.9</td>
<td>15.6</td>
<td>16</td>
<td>16.8</td>
<td>15.9</td>
</tr>
<tr>
<td>% of Memory Required</td>
<td></td>
<td>0.36</td>
<td>1.12</td>
<td>2.24</td>
<td>4.47</td>
<td>8.94</td>
<td>16.76</td>
<td>27.94</td>
<td>31.29</td>
<td>33.53</td>
<td>35.76</td>
<td>38.00</td>
</tr>
<tr>
<td>error()</td>
<td></td>
<td>0.31</td>
<td>0.71</td>
<td>0.07</td>
<td>0.33</td>
<td>0.76</td>
<td>0.92</td>
<td>0.68</td>
<td>0.30</td>
<td>0.62</td>
<td>0.81</td>
<td>0.91</td>
</tr>
</tbody>
</table>

Table 3.23: Experimental results with 4 GiB Memory Configuration Nehalem microarchitecture.
<table>
<thead>
<tr>
<th>Metrics</th>
<th>1</th>
<th>2</th>
<th>4</th>
<th>8</th>
<th>16</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>256</th>
<th>512</th>
<th>1024</th>
<th>2048</th>
<th>4096</th>
<th>8192</th>
<th>16384</th>
</tr>
</thead>
<tbody>
<tr>
<td>L0 Hit Ratio</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
<td>0.9032</td>
</tr>
<tr>
<td>D3 Hit (%)</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
<td>4.8</td>
</tr>
<tr>
<td>MemHit(%)</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
<td>3.3</td>
</tr>
<tr>
<td>L1 Hit Ratio</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
<td>0.5614</td>
</tr>
<tr>
<td>L2 Hit Ratio</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
<td>2.1</td>
</tr>
<tr>
<td>L3 Hit Ratio</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
<td>831.7</td>
</tr>
<tr>
<td>Remote Access Ratio</td>
<td>0.5072</td>
<td>0.4767</td>
<td>0.4657</td>
<td>0.4567</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
<td>0.4504</td>
</tr>
<tr>
<td>Hittime(msec)</td>
<td>30.3</td>
<td>22.0</td>
<td>23.9</td>
<td>28.6</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
<td>28.7</td>
</tr>
<tr>
<td>Total Time</td>
<td>92.9</td>
<td>91.2</td>
<td>91.7</td>
<td>92.4</td>
<td>92.2</td>
<td>91.8</td>
<td>91.0</td>
<td>91.1</td>
<td>90.9</td>
<td>90.7</td>
<td>90.5</td>
<td>90.2</td>
<td>89.8</td>
<td>89.5</td>
<td>89.2</td>
</tr>
<tr>
<td>% of Memory Required error (%)</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
<td>0.75</td>
</tr>
<tr>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
<td>0.05</td>
</tr>
</tbody>
</table>
Chapter 3. Descriptive models for resource usage characteristics of applications
Chapter 4

Executable models for interventional X-Ray

4.1 Introduction

Designed for interventional cardiac, brain and vascular procedures, Allura X-Ray system provides angiography that is extremely reliable and displays images of very high clarity. Diagnosis and treatments based on angiography provided by this equipment plays a gigantic role in saving lives. This makes Allura X-Ray system an extremely sophisticated and important equipment for critical medical decisions. In order to ensure that the system meets the expectations of customers, a series of comprehensive tests are conducted at Philips Healthcare before the system is delivered to the customer. Though these tests unearth several failures before product delivery to the customer, a failure rate of 6% is still observed. The failures encompass system software, application software and hardware failures. Unlike a PC used by a layman, a deviation in the performance of the PC used in Allura system can result in loss of critical information required to make important medical decisions. Therefore, identifying these failures and removing under-performing PCs from the system assembly has been a research area of prime importance. This chapter presents the investigation done in this area and a brief description of the executable models that can be used to predict failure of the PCs based on the hardware resource usage metrics of the chosen application on the system.
4.2 The Failures

In the context of discovering the reason behind the failure of the PCs in the factory during the execution of Allura application, the rationale is:

- **Software Failure:** The application software performs according to the requirements on most of the PCs. The probability of failures due to all software problems encountered in the system and application software is about 3%.

- **Hardware Failure:** This kind of failure could lead to corrupt images, data loss in the network and complete breakdown of the system. Extensive set of tests conducted in the factory (night batch tests) indicate that parts of the test fail when PCs are tested for performance. The parts that fail are indicators of problems in the hardware. For example, disk and ethernet connection check tests. Hardware failure rate is reported to be 3%.

We are not concentrating on the software failures because this is not important to the production process. The production process ensures that the hardware components delivered to the customer are reliable. The R & D investigates the software bugs and failures. This narrows down the object of interest of this thesis to hardware failures.

The architectural instances of the PCs that fail and those of the PCs that perform well are exactly the same. This motivates the comparison of PCs through quantitative data obtained from performance measurements.

4.3 Test Suite Description

Allura X-Ray system components are assembled and tested using a comprehensive suite of tests designed specifically for testing the hardware components. The tests emulate the behaviour of the Allura application. They are conducted all night repetitively for several days to stress the PCs used in the system. Though each of these tests conducted exercise specific functionalities of the PCs, we are interested in an application that stresses the PC and makes it fail if it is found to have hardware problem. Verify Image Processor Test (VIPT) fits this criteria. However, this test cannot be shipped to the manufacturer because of the sophisticated environmental set-up required and the need to run the tests repetitively for several days. Nevertheless, it can be used as an application that we want to model by recording its resource usage behaviour. Therefore, we choose VIPT as the application (application and program are used interchangeably hereafter) we want to model.
4.3. Test Suite Description

4.3.1 VIPT:

The purpose of VIPT is to test the Image Processing (IP) functionality of the IP subsystem (the IP software stack in combination with the IP hardware) without the hardware and software of other subsystems (e.g.: Image Detection).

The test consists of a sequence of sub-tests. The result of each subset influences which subset will be run next, for example, if the test discovers disk problems, automatically disk tests are run.

The VIPT basically consists of two phases:

1. Phase I: In this phase, the system is stressed to use all its functionality. It also constantly listens to a fault-finding component in the system to detect abnormalities, if any. See Figure 4.1 for the tests conducted in Phase I. The tests use Field Service Component IP (FSCIP) which basically forms a user interface to the Field Service Engineer to start the VIPT. It also consists of a module, Performance Monitoring and Routing (PMR) that implements different functionalities required by the interfaces of the VIPT.

2. Phase II: In this phase, the VIPT runs tests needed to analyse the errors found (if any) during the first phase.

The series of tests, called Image Channel (IC) tests conducted during Phase I are as follows:
1. **Init**: This is used to start all required graphs and a special acquisition procedure that gets images from the disk. It shows monitor ID, PC Name, port details on all the monitors connected to the system and serves as a visual check of the monitor connections. These can be visualized in Figure 4.2. It then starts a test that listens to the system to find errors. If no errors are found, it continues with the sub-tests.

![Figure 4.2: Init](image)

2. **Check System Test (IC1)**: This test checks a few hardware resources in the system which are easily diagnosable which are:

   - SMART data check for all disks
   - Quick ethernet check
   - Firewall check

   If problems are detected, diagnostic tests are scheduled to run during the second phase of the test.

3. **Disk Performance Test (IC2)**: This step measures the write speed of the IP PC disks. The speed should be at least 80 MB/s. The threshold of 80 MB/s is a trade-off between the expected write performance (which is specified as over 100 MB/s) and the minimally needed performance of at least 60 MB/s. The disk performance test tries to detect disk problems before they impact the functioning of the system.

4. **Image Channel IP Host (IC3)**: In this step, 100 images are displayed on all the monitors (Reference, Exam, View) of the system. The test starts by setting up the correct networks and enabling full blown image processing on the used modules. The frame speed is set to 30 Frames per Second (FPS) and then 100 images are shown on the monitors. These steps are shown in Figure 4.3. When all images have been
shown, the Cyclic Redundancy Check (CRC) of all the images is checked against the expected value. This test is repeated for every monitor.

5. Image Channel Image Recording (IC4): In this step, a repository, Recorded Image List (RIL) is created and image processing is enabled. The frame speed is set to 30 FPS and then 100 images are recorded to the repository. If no errors occur, the test is successful. The steps can be visualised in Figure 4.4.

6. Image Channel Image Playback (IC5): The test starts by setting up the correct network (e.g., host to IP PC connection, Infini-Band cable connection, ethernet) using the PMR module and enables image processing. The repository which was recorded by the previous scenario is played on all the IP PC monitors (100 images).

---

A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data.
See Figure 4.5. The CRC of all images are checked against the expected values and if all the images are displayed correctly, the test is considered successful.
4.4. Creation and Execution of Executable Models

collected in the previous test steps are analysed and specific diagnostic tests are scheduled for suspicious hardware parts.

8. Image Channel Performance (IC7): If all previous steps have passed, the performance test is run. This test is not run when any error has been detected by the previous test, since we cannot be sure what we are testing in that case.

Tests are conducted with a frame rate from 30-33 FPS. Once a frame rate of 33 FPS is reached, the test increases the frame rate (upto a maximum of 50 FPS) until the test fails. This is done to have an informal performance indicator of the system.

The series of tests run during Phase II are:

1. Disk Analysis Test: This is a self test run on the disks with problems. The tests are run on these disks in parallel.

2. Firewall Analysis Test: The firewall settings are checked on the host and the IP PC. It should be turned ON and a list of applications should be added to the exception list.

3. Infini-Band Analysis Test: When an Infini-Band (image link) problem has been detected during Phase I, this analysis test is run. It tries to determine the problem in the hardware by means of a loop-back connector which the service engineer has to place manually.

4. Ethernet Analysis Test: When an Ethernet problem has been detected during Phase I, this analysis test is scheduled to run. It tries to determine a problem with the Ethernet hardware by means of a field-service cable which the service engineer has to place manually.

Note that we have collected measurements only for Phase I of the VIPT.

4.4 Creation and Execution of Executable Models

As discussed in Chapter 4, the purpose of executable models is to predict the nature of the PC before production. The steps taken in this direction are listed below.

• Choosing the program $p_1$: A program that can stress the PC and pave the way to finding the indicators of hardware problems had to be chosen. For the reasons presented in section 4.3 and the type of tests described in Section 4.3.1, VIPT is chosen as the program (application to be modelled), $p_1$. 
- Include required metrics in the tool: The MeasureLoadCLI is used to measure the RUM for the VIPT program using Intel PCM. It measures a rich set (more than hundred) of hardware metrics with finer details such as $L_2$, $L_3$ cache hits for individual cores along with the high level metrics such as system and socket metrics. However, we found that the QPI traffic was not included as one of the metrics. The tool is enhanced by including the incoming QPI traffic to the sockets. Note that Intel does not provide support for measuring the outgoing QPI traffic in Nehalem micro-architecture and any QPI traffic on Sandy Bridge micro-architecture.

- Include the MeasureLoadCLI tool in the factory tool-set: The tool was included in the factory tool-set using a script that concurrently runs the tool along with the VIPT. The metrics measured are logged into a text file for analysis.

The measurements are done for every one second. Some of the most important metrics measured are listed in Table 4.1.

<table>
<thead>
<tr>
<th>m.Core$_i$Usage</th>
<th>m.Socket$_j$Usage</th>
<th>m.Usage</th>
</tr>
</thead>
<tbody>
<tr>
<td>m.Core$_i$L$_3$Miss</td>
<td>m.Socket$_j$L$_3$Miss</td>
<td>m.L$_3$Miss</td>
</tr>
<tr>
<td>m.Core$_i$L$_2$Miss</td>
<td>m.Socket$_j$L$_2$Miss</td>
<td>m.L$_2$Miss</td>
</tr>
<tr>
<td>m.Core$_i$L$_3$HitRatio</td>
<td>m.Socket$_j$L$_3$HitRatio</td>
<td>m.L$_3$HitRatio</td>
</tr>
<tr>
<td>m.Core$_i$L$_2$HitRatio</td>
<td>m.Socket$_j$L$_2$HitRatio</td>
<td>m.L$_2$HitRatio</td>
</tr>
<tr>
<td>-</td>
<td>m.Socket$_j$M$_r$</td>
<td>m.M$_r$</td>
</tr>
<tr>
<td>-</td>
<td>m.Socket$_j$M$_w$</td>
<td>m.M$_w$</td>
</tr>
<tr>
<td>-</td>
<td>m.Socket$_j$QPI</td>
<td>m.QPI</td>
</tr>
</tbody>
</table>

As seen from Table 4.1, the metrics are measured for core, socket and the system. The subscript $i = 1 \ldots 8$ for the core and $j = 1, 2$ for the socket. The definitions of the metrics are adopted from Chapter 3. The core, socket and system usage can be defined as:

$$
Usage = \frac{Instructions_{retired} \times Threads_{per\ core}}{Activecycles \times maxIPC} \times 100 \quad (4.1)
$$

where,

- Instructions retired: The total number of instructions executed in the measurement interval

- Threads per core: This is an indication of whether Simultaneous Multi-Threading (SMT) is ON or OFF in the PC. If SMT is ON, the value of threads per core is 2. If SMT is OFF or not supported at all, the value of threads per core is 1. It is equal to 1 for the PCs used for our study.
4.4. Creation and Execution of Executable Models

- Maximum Instruction Per Cycle (maxIPC): The maximum number of instructions that can be executed per cycle. It is equal to 4 in Nehalem and Westmere micro-architectures.

- Active cycles: The total amount of clock ticks in the measurement interval.

4.4.1 Hardware model of the PC

Allura X-Ray system evolves along with the Intel roadmap of processor micro-architectures as discussed in Chapter 3. The Allura system incorporates processors with newer micro-architectures. Not all the hardware models of these processors can be discussed here for brevity. For measurements and metrics, we use the hardware model representation used in Chapter 3. Refer to Figure 3.2.

4.4.2 Decisions and Limitations

To design and execute executable models it was necessary to make valid assumptions and decisions. They are:

The Tools used: The MeasureLoadCLI tool is used to measure the hardware metrics and the RUMM tool is used to create and execute executable models. Note that the timers used in both the tools are different and this has to be taken into consideration if the measurements from both the tools are to be compared.

Filter Background Noise: It is important to filter the metrics measured from the background noise resulting from the tools used. For this purpose, measurements were done on an idle IP PC using MeasureloadCLI and RUMM tool. This has to be considered before deciding the threshold values of metrics used in creating executable models and the analysis that follows.

Hardware setting: The RUMM tool supports the creation and execution of executable models only on a dual socket PC. It was also required to turn the Non Uniform Memory Access (NUMA) on in the BIOS setting of the PC. In Section 4.6, there is a detailed explanation of the rationale behind this decision.

Data Availability: The factory environment setting for collecting the measurements from the PCs requires several rounds of testing to ensure it does not disrupt the system set-up. So, the data required for creating executable models is available only after the testing phase is completed. The executable models that can be used to predict hardware failures can be created and validated only when such PCs are made available.
4.4.3 RUMM Tool

As discussed in Chapter 2, micro-benchmark codes are used to create executable models. The term micro-benchmark in this context refers to a piece of code which is designed to create a certain amount of load on a hardware resource. The amount of load is variable from zero (no load) to 100% (full load). The process of increasing or decreasing load on a hardware resource through a micro-benchmark is referred to as tuning.

The RUMM tool created by Gebreweld uses micro-benchmark codes to create executable models. It is a dashboard tool that can be used for the following purposes:

1. Make measurements on a single/ dual socket PC
2. Analyse measurements using a Chart Tool
3. Create executable models for a two socket PC
4. Generate load on a PC by running the executable model created

We are interested in the functionalities of the RUMM tool listed in the steps 3 and 4.

4.5 Experiments and Results

In Chapter 2, we presented the activity diagram for creating and validating the executable models in Figures 2.3, 2.4, 2.5. The first step is collecting the RUM of the program, $p_1$ on any machine, $m$.

Quantitative analysis of data is a prerequisite to creating executable model, $x$. This requires measurements from several PCs. This serves as a basis to classify the machine $m$ into good/bad.

We collected measurements from 6 PCs denoted as $m_1, \ldots, m_6$. Based on the results of several tests conducted on machines, $m_1, m_2, m_3, m_4$, they are classified as good PCs, which means that the machines show no signs of failure during night batch tests. They exhibit high performance. This is used as the ground to verify the models created. $m_5$ and $m_6$ are classified as bad PCs. $m_5$ is a machine whose disk was found to be missing to the application software and $m_6$ is a machine which showed failures during the night batch tests conducted. $m_6$ is a machine which has hard-to-find hardware problem.

The profiling of $m_1, \ldots, m_4$ is based on the measurements of $p_1$ and is used as a reference to decide the threshold values of the metrics on any good PC. Although $m_5$ is a bad PC, the reason of failure is known (missing disk). The ultimate aim of this part of
thesis is to predict hard-to-find failures. $m_5$ is still used in the study to compare the behaviour of two bad PCs, one with known reason ($m_5$) and the other, with unknown reason ($m_6$). The breakdown of the program, $p_1$ chosen (VIPT) into steps described in the Section 4.3.1 is shown in Figure 4.6. Figure 4.6 shows the socket usage (%) for about 200 samples of measurement on $m_1$.

The profiling of the machines $m_1, \ldots, m_4$ is compared with machines, $m_5$ and $m_6$. $p_1$ is run 25 times and 200 samples are collected, each time $m_1, \ldots, m_5$ are tested and 300 samples of measurement are collected, each time $m_6$ is tested. Figure 4.7, Figure 4.8 are plotted after taking the average of each sample from every trial for all the samples collected.

Figure 4.6: VIPT steps

Figure 4.7: Hardware metrics: $\in K$. Usage(%) for both CPUs is plotted against the samples of measurement (time series). The steps of $p_1$ (VIPT) are also separately shown using dash-dotted lines.
The hardware metrics, socket(CPU) usages as shown in Figure 4.7 are denoted as $k_1, k_2 \in K$ for every machine selected. From Figure 4.7 it can be seen that $m_5$ (in blue) and $m_6$ (in cyan) exhibit change in the behaviour. They exhibit time shift in the values plotted along the x-axis and show significant changes in the behaviour during IC7. This is because, the performance of $m_5$ is affected due to the missing disk; the system becomes slower and exhibits a time shifted pattern when compared to machines, $m_1, \ldots m_4$. $m_5$ shows little or no performance metric numbers (predictable due to missing disk). $m_6$, on the other hand, shows performance numbers during IC7, but are lower than the numbers of the set of good PCs. $m_1, \ldots m_4$ exhibit a similar pattern for all metrics observed along the x-axis. However, $m_2$ (in pink) exhibits magnified performance numbers on socket 1 and reduced performance numbers on socket 2 when compared with machines, $m_1, m_3, m_4$ along the y-axis. This suggests that machines which are classified as good PCs exhibit similar pattern, but admit small variations in the peak values along the y-axis. This has to be taken into account when deciding the threshold value of metrics when creating executable models. We can draw similar inferences from Figures 4.8.

![Figure 4.8: Hardware metrics](image)

(a) Socket 1 Mread  
(b) Socket 1 Mwrite  
(c) Socket 2 Mread  
(d) Socket 2 Mwrite

Figure 4.8: Hardware metrics: $\in K$: The total amount of bytes read from($M_r$) and written into ($M_w$) the memory (in GB/s) for both CPUs is plotted against the time series. They are represented as $k_3, k_4, k_5, k_6 \in K$ for every machine selected.

For all the hardware metrics plotted, we can see that the IC7 test distinguishes the behaviour of $m_1, \ldots m_4$ from $m_5$ and $m_6$. Therefore, $p_1$ (VIPT) is a good indicator of hardware failures in a PC. Reflecting back to Figure 2.3 in Chapter 2, this can be verified using Equation 2.2.

The part of the thesis where the executable models are to be created and validated remains unfulfilled because the quantitative data from several PCs, good and bad was not available in time during the execution of the project. However, a suitable program that identifies difference in the pattern of the behaviour of PCs is identified and some metrics that exhibit difference in terms of time shift in performance numbers are demonstrated. It is also found that deciding a threshold value for metrics that are indicators...
to failures need to take a range of values into consideration, considering the variations in performance numbers on good PCs. Quantitative data from many more PCs can be used to create reliable models that take the dynamics of behaviour of distinct PCs into consideration. Taking these factors into consideration, this work can be carried forward to achieve the intended goals. This is future work.

4.6 Lessons Learned

The experiments were conducted on a dual socket PC for reasons mentioned in the section 4.4.2. Since the PCs used had two physical sockets, we expected to see measurements for both the sockets. However, what we could see is that the PCM counters only displayed 1 physical socket and the measurements were also done for one socket only. After exploring the BIOS settings of the PCs used, it was seen that the NUMA feature was disabled for performance reasons known to Philips Healthcare.

When a separate DRAM is associated with every IMC and chip, platforms with more than one chip are NUMA. NUMA organisations have distinct performance advantages and disadvantages. As they are not a part of this project, they are not discussed in detail. But, it is important to understand that one can choose to populate all the memory banks by turning NUMA on or populate only one memory bank by tuning NUMA off. For applications that function only in the presence of two sockets, NUMA has to be turned on in the BIOS settings. Metrics such as QPI measurements can be measured only when NUMA is enabled. Therefore, we enabled NUMA.
Chapter 5

Conclusions and Future Work

In Cardiovascular minimal invasive interventions, physicians require low-latency X-ray imaging applications. This requires sufficient performance of the image-processing system while executing a plurality of functions. Multiple applications are executed in parallel and the PCs used for these systems have to offer sufficient performance under various use-cases. Adequate prediction and verification of the components used in the system based on modeling beforehand saves time and investment. They can also be given to hardware vendors for use in the selection process. In addition to this, models can be used to decide the optimum combination of applications on a specific hardware platform. This thesis proposes the creation of descriptive and executable models to be able to predict the performance of the system. More precisely, it consists of two different parts which are:

1. To develop executable models using resource usage characteristics of an application run on a PC to be able to predict the nature of the PC (good/bad) before the production phase

2. To develop descriptive models to predict the execution of application(s), depending on the hardware model of the PC

In the context of part I of the thesis, the aim is to create executable models that emulate the behaviour of the Allura application (or of any application that shows that the PC is faulty). These models unearth the causes of unexpected and hard-to-find failures in the PCs used in the Allura system.Executable models can be created using micro-benchmark codes. On running these models on a PC, from the measurements of the hardware metrics on that PC, the nature of the PC should be deduce-able. This kind of prediction offers significant cost reduction by avoiding shipping charges, that would otherwise happen due to the failure of the PCs, saves time by manufacturing PCs that meet the required specifications if the models are shipped to the manufacturer of the
Creating executable models relies on the availability of measurements of various suspicious hardware metrics that are most likely to cause failures. A thorough analysis of the quantitative measurements from several PCs is definitely required to discover the agents of failure and create models that stress the hardware components that are to point to the agents of failure. However, for a sophisticated system like Allura X-ray system, collecting such measurements involves several steps such that they do not impair the functioning of the working system. The tools that are used to collect measurements need to be integrated to the already existing tool-set framework of the Allura System and this procedure requires rigorous and repetitive execution of tests to ensure reliability. This was one of the major challenges encountered during the execution of this project. The failure rate of the PCs used in the Allura system is less than 6 %. In order to be able to create executable models, it is required to be able to test and take measurements from PCs that are known to have hard-to-find hardware problems. This was one other major challenge encountered. We were able to test only one such PC. The Allura System executes a plurality of functions by the integration of several distinct PCs. The system functions with Allura application successfully only after careful integration of all the system components. Therefore, testing an IP PC with Allura application inevitably requires the successful set up of the entire cabinet of PCs used in the system. This task is complex and is another major challenge encountered in this project.

Due to the challenges discussed, it was not possible to complete the creation of executable models that predict the nature of the PC. However, some very useful and interesting observations were made that are useful for the future course of this part of the thesis. They are:

1. Effect of NUMA: The IP PCs used for study have two sockets (and so, two CPUs). The executable models use micro-benchmark codes that stress specific hardware components. The OS should be able to recognize both the sockets and their respective memory banks to be able to create and execute models. Also, metrics like QPI traffic are only captured for a two socket machine in Intel PCM. The OS distinguishes the two sockets physically only when the NUMA is enabled. For Allura X-ray systems, the NUMA feature is disabled because it is known to exhibit better performance with this feature tuned OFF using the BIOS settings of the PC. Therefore, to be able to create models and capture QPI metrics, it is necessary to enable NUMA.

2. Choice of the application: The creation of useful executable models that can actually predict failures depends on the correct choice of the application that, not only emulates the Allura application, but stresses the hardware components of interest specifically to observe unusual behaviour. After certain experiments and expert suggestions, it was found that VIPT, a part of the testing framework used for Allura systems can be used as the application (program) for which the measurements
3. Data Analysis: Data was successfully collected from four PCs with no hardware problems with the required settings. Several metrics such as the socket usages, socket cache misses, socket reads and writes from and to the memory were analysed to able to draw conclusions about the pattern of the behaviour of PCs with no problems. This step led to the inference that PCs with no hardware problems show high levels of performance during all the phases of VIPT (application chosen). They also demonstrate comparable level of numbers for hardware metrics that could be used to decide the threshold values of relevant metrics for all good PCs. The measurements collected from one PC with hard-to-find problems and one PC with missing disk measurements led to a definite observation that the PCs that are known to have hardware flaws exhibit drastic change in the performance numbers during IC7 of VIPT. They are also slower and show a time shift in the pattern when compared to the pattern of a set of good PCs. From the metrics observed it can also be seen that IC7 is the test that distinguishes the PCs based on their nature. Therefore, the data from this test only can be used for future analysis and observations.

These observations can be used to create executable models. Philips Healthcare has a toolset, RUMM that can be used to create executable models. However, these models can only create models for worst case resource usage scenario. This tool-set can be enhanced to be able to create executable models by considering all the measurement samples. The metrics such as QPI, cache hit ratios, cache misses are not a part of the modeling suite in RUMM. These fine level metrics, if included in the tool-set, can predict the nature of the PC along with the coarse level metrics already measurable and tunable in the tool-set. With these ideas and suggestions, this part of the thesis can be executed further to achieve the first goal of the thesis.

In the context of part 2 of the thesis, due to the challenges mentioned, it is not easy to incorporate the Allura application on an IP PC. However, as an initial step, descriptive models can be created using other featured applications such as STREAM. Descriptive models are created using the measurements collected on IP PCs using MeasureLoadCLI tool and Perfmon utility that capture several fine level metrics extremely useful for this part of the assignment. The contribution of this part of the thesis are:

1. Performance models that are used to predict the performance of the same PC with different memory configurations: The performance models created is used to predict the latency of an application on a PC with a certain hardware model. These models use the resource usage characteristics of the application on hardware platforms. Using this as the key principle, the resource usage characteristics of the same application on a slightly modified hardware model is predicted. We have chosen the same IP PC with different memory configurations as hardware models to predict the performance of STREAM. These models can not only predict the
performance, but also give performance numbers that can be used to compare different hardware configurations for an application. In the due course of this part of the thesis, other observations were also made, the most important among them being:

(a) Effect of Asymmetric Memory Distribution: It is seen and proved using the models created and measurements collected from tools and STREAM that the performance of a two socket PC with asymmetric memory configurations is found to be lower than the performance of the same PC with lesser memory capacity, but, distributed equally among the memory banks of the CPUs. Quantitatively, the performance degradation that happens due to asymmetric memory distribution is approximately equal to 20 %.

(b) Effect of the OS: It is also noted that the utilisation of the total memory installed and thus, the performance of the application on the PC used depends on the OS used. A 32 bit OS can only support a maximum of $2^{32} = 4$ GiB of memory. Therefore if a 32 bit OS is installed on a PC with 6 GiB memory configuration, cost reduction can be achieved by removing the unused memory (3 GiB) memory installed. Additionally, the performance of STREAM on the PC with 4 GiB of memory installed is also found to be better than the performance of the same PC with 6 GiB of memory installed because the OS does not recognise 3 GiB of memory installed on socket 2.

2. Performance models that are used to predict the performance of PCs with different micro-architectures: The performance models created are used to predict the latency of an application on a PC with entirely different hardware model. This can be put to use to evaluate the performance of the Allura system that evolves with the Intel processor roadmap. If the performance of the PC with new micro-architecture can be predicted, it can be used as a performance forecast before evolving into newer micro-architectures. This can be used to study the pros and cons of every micro-architecture before incorporating it as a part of the Allura system. During the execution of this part of the thesis, it was found that the PC with Sandy Bridge micro-architecture exhibits better performance than the PC with Nehalem micro-architecture. The models can also be used to compare the performances of PCs with different micro-architectures. The PC with the Sandy Bridge micro-architecture shows a performance improvement of 26 % over the best possible configuration of the PC with Nehalem micro-architecture(4 GiB memory configuration). This is also validated using the measurements from STREAM. The most interesting inferences made during the course of this part of the thesis are:

(a) Differences in the micro-architecture: It was seen that every aspect of the Sandy Bridge processor was redesigned by Intel to achieve better performance. The most important architectural enhancements were the inclusion of $L_0$ cache, $L_3$ cache on die, increased number of memory channels, increase in
the memory frequency, increase in the number of QPI links and its frequency, change in the snooping protocol and introduction of Turbo Boost technology. The performance improvements due to these architectural changes are predicted and incorporated in the models. The models created are validated on the PC with Sandy Bridge micro-architecture.

(b) Effect Of Turbo Boost: It was seen that the processor frequency ramps up to provide increased level of performance and then tunes down to create thermal reserves in the PC with Sandy Bridge micro-architecture. This affects the performance of the application.

The descriptive models created for both objectives can be used to predict the performance of the PCs for real time applications such as Allura. The set of descriptive models can be extended and enriched to study other performance properties such as throughput, bandwidth for network, graphic cards as well. This broadens the scope of comparison and results in better performance evaluation of systems with higher workloads and real time constraints. The enriched set of descriptive models created can be used to achieve significant cost reduction by combining two or more PCs into one PC for a multi-functional system with several PCs such as the Allura system.
References


