Intel Microarchitectures
The Cisco UCS uses the Intel® Processor belonging to the Nehalem and Westmere microarchitectures (more generally, the 32nm and 45nm Hi-k Intel® Core™ microarchitectures).
The Nehalem microarchitectures was introduced in servers in early 2009 and was one of the first architectures to use the new 45 nm (nanometer) silicon technology developed by Intel [32], [33], [34]. Nehalem processors span the range from high-end desktop applications, up through very large-scale server platforms. The codename is derived from the Nehalem River on the Pacific coast of northwest Oregon in the United States.
In Intel® parlance, processor developments are divided into "tick" and "tock" intervals, as in Figure 2-24. Tick is a technology that shrinks an existing processor, while tock is a new architecture done in the previous technology. Nehalem is the 45 nm tock. Westmere is the 32 nm tick following Nehalem.
Figure 2-24 Intel "Tick/Tock" processor development model
Nehalem and Westmere are a balance between different requirements:
- Performance of existing applications compared to emerging application (e.g., multimedia)
- Equally good support for applications that are lightly or heavily threaded
- Implementations that range from laptops to servers
They try to optimize for performance and at the same time reduce power consumption. This discussion is based on a nice Intel® Development Forum Tutorial [32]. In the rest of this chapter, the innovations are described in relation to the Nehalem microarchitecture, but they are also inherited in the Westmere architecture. There are few places where the architectures of Nehalem and Westmere differ and these will be highlighted.
Platform Architecture
It is the biggest platform architecture shift in about 10 years for Intel. The inclusion of multiple high-speed point-to-point connections, i.e., Intel® QuickPath Interconnect (see "Dedicated High-Speed Interconnect" in Chapter 2, page 30), along with the use of Integrated Memory Controllers (IMCs), is a fundamental departure from the FSB-based approach.
An example of a dual-socket Intel® Xeon® 5500 (Nehalem-EP) systems is shown in Figure 2-25. Please note the QPI links between the CPU sockets and from the CPU sockets to the I/O controller, and the memory DIMMs directly attached to the CPU sockets.
Figure 2-25 Two-socket Intel Xeon 5500 (Nehalem-EP)
Integrated Memory Controller (IMC)
In Nehalem-EP and Westmere-EP, each socket has Integrated Memory Controller (IMC) that supports three DDR3 memory channels (see "DDR2 and DDR3" in Chapter 2, page 41). DDR3 memories run at higher frequency when compared with DDR2, thus higher memory bandwidth. In addition, for dual-socket architecture, there are two sets of memory controllers instead of one. All these improvements lead to a 3.4x bandwidth increase compared to the previous Intel® platform (see Figure 2-26).
Figure 2-26 RAM bandwidth
This will continue to increase over time, as faster DDR3 becomes available. An integrated memory controller also makes a positive impact by reducing latency.
Power consumption is also reduced, since DDR3 is 1.5 Volt technology compared to 1.8 Volts of DDR2. Power consumption tends to go with the square of the voltage and therefore a 20% reduction in voltage causes approximately a 40% reduction in power.
Finally, the IMC supports both RDIMM and UDIMM with single, dual, or quad ranks (quad ranks is only supported on RDIMM; see "Memory Ranks" on page 39 and "UDIMMs and RDIMMs" on page 40, both from Chapter 2).
Nehalem-EX has a similar, but not identical, architecture. In Nehalem-EX, there are two IMCs per socket. Each IMC supports two Intel® Scalable Memory Interconnects (SMIs) connected to two Scalable Memory Buffers (SMBs) for a total of four SMBs per socket (see Figure 2-27). Each SMB has two DDR3 buses, each connecting two RDIMMs. The total number of RDIMMs per socket is therefore sixteen.
Figure 2-27 SMIs/SMBs
The overall memory capacity of a Nehalem-EX system as a function of the number of sockets and of the RDIMM capacity is summarized in Table 2-4.
Table 2-4. Nehalem-EX Memory Capacity
4GB RDIMM |
8GB RDIMM |
16GB RDIMM |
|
2 sockets |
128 GB |
256 GB |
512 GB |
4 sockets |
256 GB |
512 GB |
1 TB |
8 sockets |
512 GB |
1 TB |
2 TB |
Intel® QuickPath Interconnect (QPI)
All the communication architectures have evolved over time from buses to point-to-point links that are much faster and more scalable. In Nehalem, Intel® QuickPath Interconnect has replaced the front-side bus (see Figure 2-28).
Figure 2-28 Intel QPI
Intel® QuickPath Interconnect is a coherent point-to-point protocol introduced by Intel®, not limited to any specific processor, to provide communication between processors, I/O devices, and potentially other devices such as accelerators.
The number of QPIs available depends on the type of processor. In Nehalem-EP and in Westmere-EP, each socket has two QPIs allowing the topology shown in Figure 2-25. Nehalem-EX supports four QPIs allowing many other glueless topologies, as shown in Figure 2-29.
Figure 2-29 Nehalem-EX topologies
The Intel® Xeon® processor 7500 is also compatible with third-party node controllers that are capable of interconnecting more than eight sockets for even greater system scalability.
CPU Architecture
Nehalem increases the instructions per second of each CPU by a number of innovations depicted in Figure 2-30.
Figure 2-30 Nehalem microarchitecture innovations
Some of these innovations are self-explanatory; we will focus here on the most important one that deals with performance vs. power.
In comparing performance and power, it is normally assumed that a 1% performance increase with a 3% power increase is break-even. The reason is that it is always possible to reduce the voltage by 1% and reduce the power by 3% (see "Chip Design" in Chapter 2, page 61).
The innovations that are important are those that improve the performance by 1% with only a 1% increase in power (better than break-even).
Intel® Hyper-Threading Technology (Intel® HT Technology)
Intel® Hyper-Threading Technology (Intel® HT Technology) is the capability of running simultaneously multiple threads on the same core, in the Nehalem/Westmere implementation two threads. This enhances both performance and energy efficiency (see "Intel® Hyper-Threading Technology" in Chapter 2, page 27).
The basic idea is that with the growing complexity of each execution unit, it is difficult for a single thread to keep the execution unit busy. Instead by overlaying two threads on the same core, it is more likely that all the resources can be kept busy and therefore the overall efficiency increases (see Figure 2-31). Hyper-Threading consumes a very limited amount of area (less than 5%), and it is extremely effective in increasing efficiency, in a heavily threaded environment. Hyper-Threading is not a replacement for cores; it complements cores by allowing each of them to execute two threads simultaneously.
Figure 2-31 Intel HT technology
Cache-Hierarchy
The requirement of an ideal memory system is that it should have infinite capacity, infinite bandwidth, and zero latency. Of course, nobody knows how to build such a system. The best approximation is a hierarchy of memory subsystems that go from larger and slower to smaller and faster. In Nehalem, Intel® added one level of hierarchy by increasing the cache layers from two to three (see Figure 2-32).
Figure 2-32 Cache hierarchy
Level one caches (L1) (Instruction and Data) are unchanged compared to previous Intel® designs. In the previous Intel® design, the level two caches (L2) were shared across the cores. This was possible since the number of cores was limited to two. Nehalem increments the number of cores to four or eight, and the L2 caches cannot be shared any longer, due to the increase in bandwidth and arbitration requests (potentially 8X). For this reason, in Nehalem Intel® added L2 caches (Instruction and Data) dedicated to each core to reduce the bandwidth toward the shared caches that is now a level three (L3) cache.
Segmentation
Nehalem is designed for modularity. Cores, caches, IMC, and Intel® QPI are examples of modules that compose a Nehalem processor (see Figure 2-30).
These modules are designed independently and they can run at different frequencies and different voltages. The technology that glues all of them together is a novel synchronous communication protocol that provides very low latency. Previous attempts used asynchronous protocols that are less efficient.
Integrated Power Gate
This is a power management technique that is an evolution of the "Clock Gate" that exists in all modern Intel® processors. The clock gate shuts off the clock signals to idle logic, thus eliminating switching power, but leakage current remains. Leakage currents create leakage power consumption that has no useful purpose. With the reduction in channel length, starting approximately at 130 nm, leakage has become a significant part of the power, and at 45 nm, it is very important.
The power gate instead shuts off both switching and leakage power and enables an idle core to go to almost zero power (see Figure 2-33). This is completely transparent to software and applications.
Figure 2-33 Nehalem power gate
The power gate is difficult to implement from a technology point of view. The classical elements of 45 nm technologies have significant leakage. It required a new transistor technology with a massive copper layer (7 mm) that was not done before (see Figure 2-34).
Figure 2-34 Power gate transistor
The power gate becomes more important as the channel length continues to shrink, since the leakage currents continue to increase. At 22 nm, the power gate is essential.
Nehalem-EP and Westmere-EP have "dynamic" power-gating ability to turn core power off completely when the core is not needed for the given workload. Later, when the workload needs the core's compute capability, the core's power is reactivated.
Nehalem-EX has "static" power gating. The core power is turned off completely when the individual core is disabled in the factory, such as when an 8-core part is fused to make a 6-core part. These deactivated cores cannot be turned back on. On prior generations, such factory deactivated cores continued to consume some power. On Nehalem-EX, the power is completely shut off.
Power Management
Power sensors are key in building a power management system. Previous Intel® CPUs had thermal sensors, but they did not have power sensors. Nehalem has both thermal and power sensors that are monitored by an integrated microcontroller (PCU) in charge of power management (see Figure 2-35).
Figure 2-35 Power Control Unit
Intel® Turbo Boost Technology
Power gates and power management are the basic components of Intel® Turbo Boost Technology. Intel® Turbo Boost mode is used when the operating system requires more performance, if environmental conditions permit (sufficient cooling and power)—for example, because one or more cores are turned off. Intel® Turbo Boost increases the frequency of the active cores (and also the power consumption), thus increasing the performance of a given core (see Figure 2-36). This is not a huge improvement (from 3% to 11%), but it may be particularly valuable in lightly or non-threaded environments in which not all the cores may be used in parallel. The frequency is increased in 133MHz steps.
Figure 2-36 Intel Turbo Boost Technology
Figure 2-36 shows three different possibilities: In the normal case, all the cores operate at the nominal frequency (2.66 GHz); in the "4C Turbo" mode, all the cores are frequency upgraded by one step (for example, to 2.79 GHz); and in the "<4C Turbo" mode, two cores are frequency upgraded by two steps (for example, to 2.93 GHz).
Virtualization support
Intel® Virtualization Technology (VT) extends the core platform architecture to better support virtualization software—e.g., VMs (Virtual Machines) and hypervisors aka VMMs (Virtual Machine Monitors); see Figure 2-37.
Figure 2-37 Virtualization support
VT has four major components:
- Intel® VT-x refers to all the hardware assists for virtualization in Intel® 64 and IA32 processors.
- Intel® VT-d for Directed I/O (Intel® VT-d) refers to all the hardware assists for virtualization in Intel chipset.
- Intel® VT-c for Connectivity (Intel® VT-c) refers to all the hardware assists for virtualization in Intel networking and I/O devices.
- VT Flex Migration to simplify Virtual Machine movement.
Intel® VT-x enhancements include:
- A new, higher privilege ring for the hypervisor—This allows guest operating systems and applications to run in the rings they were designed for, while ensuring the hypervisor has privileged control over platform resources.
- Hardware-based transitions—Handoff between the hypervisor and guest operating systems are supported in hardware. This reduces the need for complex, compute-intensive software transitions.
- Hardware-based memory protection—Processor state information is retained for the hypervisor and for each guest OS in dedicated address spaces. This helps to accelerate transitions and ensure the integrity of the process.
In addition, Nehalem adds:
- EPT (Extended Page Table)
- VPID (Virtual Processor ID)
- Guest Preemption Timer
- Descriptor Table Exiting
- Intel® Virtualization Technology FlexPriority
- Pause Loop Exiting
VT Flex Migration
FlexMigration allows migration of the VM between processors that have a different instruction set. It does that by synchronizing the minimum level of the instruction set supported by all the processors in a pool.
When a VM is first instantiated, it queries its processor to obtain the instruction set level (SSE2, SSE3, SSE4). The processor returns the agreed minimum instruction set level in the pool, not the one of the processor itself. This allows VMotion between processors with different instruction set.
Extended Page Tables (EPT)
EPT is a new page-table structure, under the control of the hypervisor (see Figure 2-38). It defines mapping between guest- and host-physical addresses.
Figure 2-38 Extended page tables
Before virtualization, each OS was in charge of programming page tables to translate between virtual application addresses and the "physical addresses". With the advent of virtualization, these addresses are no longer physical, but instead local to the VM. The hypervisor needs to translate between the guest OS addresses and the real physical addresses. Before EPT, the hypervisors maintained the page table in software by updating them at significant boundaries (e.g., on VM entry and exit).
With EPT, there is an EPT base pointer and an EPT Page Table that allow it to go directly from the virtual address to the physical address without the hypervisor intervention, in a way similar to how an OS does it in a native environment.
Virtual Processor ID (VPID)
This is the ability to assign a VM ID to tag CPU hardware structures (e.g., TLBs: Translation Lookaside Buffers) to avoid flushes on VM transitions.
Before VPID, in a virtualized environment, the CPU flushes the TLB unconditionally for each VM transition (e.g., VM Entry/Exit). This is not efficient and adversely affects the CPU performance. With VPID, TLBs are tagged with an ID decided by the hypervisor that allows a more efficient flushing of cached information (only flush what is needed).
Guest Preemption Timer
With this feature, a hypervisor can preempt guest execution after a specified amount of time. The hypervisor sets a timer value before entering a guest and when the timer reaches zero, a VM exit occurs. The timer causes a VM exit directly with no interrupt, so this feature can be used with no impact on how the VMM virtualizes interrupts.
Descriptor Table Exiting
Allows a VMM to protect a guest OS from internal attack by preventing relocation of key system data structures.
OS operation is controlled by a set of key data structures used by the CPU: IDT, GDT, LDT, and TSS. Without this feature, there is no way for the hypervisor to prevent malicious software running inside a guest OS from modifying the guest's copies of these data structures. A hypervisor using this feature can intercept attempts to relocate these data structures and forbid malicious ones.
FlexPriority
This is a technique to improve performance on older 32-bit guest OS's. It was designed to accelerate virtualization interrupt handling thereby improving virtualization performance. FlexPriority accelerates interrupt handling by preventing unnecessary VMExits on accesses to the Advanced Programmable Interrupt Controller.
Pause Loop Exiting
This technique detects spin locks in multi-process guests to reduce "lock-holder preemption". Without this technique, a given virtual processor (vCPU) may be preempted while holding a lock. Other vCPUs that try to acquire lock will spin for entire execution quantum.
This technique is present in Nehalem-EX, but not in Nehalem-EP.
Advanced Reliability
A lot of the innovation in Nehalem-EX compared to Nehalem-EP is in the advanced reliability area or more properly RAS (Reliability, Availability, and Serviceability); see Figure 2-39.
Figure 2-39 Nehalem-EX RAS
In particular, all the major processor functions are covered by RAS, including: QPI RAS, I/O Hub (IOH) RAS, Memory RAS, and Socket RAS.
Corrected Errors are now signaled using the Corrected Machine Check Interrupts (CMCI).
An additional RAS technique is Machine Check Architecture-recovery (MCAr); i.e., a mechanism in which the CPU reports hardware errors to the operating system. With MCAr, it is possible to recover from otherwise fatal system errors.
Some features require additional operating system support and/or requires hardware vendor implementation and validation.
This technology is implemented only in Nehalem-EX.
Advanced Encryption Standard
Westmere-EP adds six new instructions for accelerating encryption and decryption of popular AES (Advanced Encryption Standard) algorithms. With these instructions, all AES computations are done by hardware and they are of course not only faster, but also more secure than a software implementation.
This enables applications to use stronger keys with less overhead. Applications can encrypt more data to meet regulatory requirements, in addition to general security, with less impact to performance.
This technology is implemented only in Westmere-EP.
Trusted Execution Technology
Intel® Trusted Execution Technology (TXT) helps detect and/or prevent software-based attacks, in particular:
- Attempts to insert non-trusted VMM (rootkit hypervisor)
- Attacks designed to compromise platform secrets in memory
- BIOS and firmware update attacks
Intel® TXT uses a mix of processor, chipset, and TPM (Trusted Platform Module) technologies to measure the boot environment to detect software attacks (see Figure 2-40).
Figure 2-40 Intel trusted execution technology
This technology is implemented only in Westmere-EP.
Chip Design
When trying to achieve high performance and limit the power consumption, several different factors need to be balanced.
With the progressive reduction of the length of the transistor channel, the range of voltages usable becomes limited (see Figure 2-41).
Figure 2-41 Voltage range
The maximum voltage is limited by the total power consumption and the reliability decrease associated with high power, the minimum voltage is limited mostly by soft errors especially in memory circuits.
In general, in CMOS design the performance is proportional to the voltage, since higher voltages allow higher frequency.
- Performance ~ Frequency ~ Voltage
Power consumption is proportional to the frequency and the square of voltage:
- Power ~ Frequency x Voltage2
and, since frequency and voltage are proportional:
- Power ~ Voltage3
Energy efficiency is defined as the ratio between performance and power, and therefore:
- Energy Efficiency ~ 1/Voltage2
Therefore, from an energy-efficiency perspective, there is an advantage in reducing the Voltage (i.e., the power; see Figure 2-42) so big that Intel® has decided to address it.
Figure 2-42 Power vs. performance
Since the circuits that are more subject to soft error are memories, Intel® in Nehalem deploys a sophisticated error-correcting code (triple detect, double correct) to compensate for these soft errors. In addition, the voltage of the caches and the voltage of the cores are decoupled so the cache can stay at high voltage while the cores works at low voltage.
For the L1 and L2 caches, Intel® has replaced the traditional six transistors SRAM design (6-T SRAM) with a new eight transistors design (8-T SRAM) that decouples the read and write operations and allows lower voltages (see Figure 2-43).
Figure 2-43 Six- vs. eight-transistor SRAM
Also, to reduce power, Intel® went back to static CMOS, which is the CMOS technology that consumes less power (see Figure 2-44).
Figure 2-44 Power consumption of different technologies
Performance was regained by redesigning some of the key algorithm like instruction decoding.