HPC Tech

Written by  John Martinez

 Performance Computing Get Ahead Of The HPC Curve

1Big-Time Compute Density. Microway’s NumberSmasher cluster combines Intel Xeon processors, fast InfiniBand interconnects, and high-speed storage in a potent computing platform.

We make a clear distinction between storage servers, rack-mounted machines, and even entry-level pedestal boxes perfect for an SMB. Although those are practically very different, the VARs who build and maintain one type of business-oriented solution are often well-qualified to take care of the others. Stepping into the world of high-performance computing is not like that. Resellers who get involved not only need expertise on the hardware side, but they should also be able to address unique software applications. We're here to help get your feet wet with the tenets of HPC.

Most conversations about servers are fairly general.

You pick a couple of processors based on the customer's budget and performance needs, choose a complementary motherboard with ample expansion, drop in memory, build out storage, and maybe even add a more capable networking controller. So long as it all comes together in the right chassis, there's a good chance you have a real winner. Resellers add value by taking those platforms and doing innovating things with them: optimizing efficiency by virtualizing a bunch of old servers, remotely managing an SMB network to make maintenance more economical, or centralizing data across a network, promoting comprehensive backups. But once you step into the realm of high-performance computing, the stakes go up.

In talking to a number of different individuals intimately involved with the HPC space, it's clear that earning credibility in this growing market necessitates more than a background in server hardware. Rather, the customers buying potent blades and clusters are academics, medical labs, research facilities, and engineering firms. In every case, they're running parallelized applications with unique requirements. And when they ask a VAR to build the right infrastructure to enable unfettered throughput, identifying potential bottlenecks and crushing them definitively becomes another part of the puzzle.

Is this space impenetrable to anyone not already in it, then? Decidedly not. Intel is pushing a couple of initiatives that make it easier for hardware vendors and system integrators to build verifiable clusters using familiar components. Moreover, the company's next generation of server motherboards will boast more diversity than ever, and HPC-specific platforms have already been announced. So, while the path isn't an easy one to walk, there are a number of different resources available to guide you along the way.

What Is HPC?

2The Future Of HPC?. The 50-core Knights Corner card demonstrates Intel’s Many Integrated Core Architecture, capable of leveraging massive parallelism to operate on x86 instructions.

"More than anythingm the HPC customer is concerned with computational horsepower," says Brett Newman, a member of system integrator Microway's sales team. And it really doesn't get much simpler than that. "Sometimes they do really fun and interesting things in order to achieve the highest possible performance. Early on, those users were the first to employ threaded programming, multi-socket systems, coprocessors like GPUs, and clusters."

 According to Newman, Microway sells a lot of its hardware to the government and academic labs. As you can imagine, details about the applications those systems end up running are hard to come by. But, on the commercial side, they include bioinformatics, weather modeling, circuit analysis, finite element analysis, computational fluid dynamics, and finance. In an academic research environment, clusters are often tasked with simulation.

For the most part, those aren't workloads that get crunched quickly. "On the lower end, you might see interaction in real-time," Newman continues. "But when it comes to the folks working on a larger scale, you're throwing a job at the cluster and waiting hours." The compute requirements for the big jobs are so large that they really can't be completed any faster. And that's why there's this perpetual race for higher and higher performance numbers. As the clusters get faster, the workloads become more complex.

To that end, how do you compare the capabilities of one cluster against another? According to Top500.org, the site that tracks the fastest computers, its list reflects performance as indicated by LINPACK. The benchmark performs numerical linear algebra, yielding a measure of a system's floating-point compute power. Does that mean you need to quantify the potential of any cluster you build as well? "Yes, we do run LINPACK-type tests on our systems," says Newman. "Typically what happens is that a customer wants to meet a certain performance characteristic. Sometimes they'll express that in a quantity of cores. Sometimes they'll express that in a LINPACK result. And sometimes they'll even express it in a raw floating-point-operations-per-second number.

There are a couple of different paths to take in getting to those requested performance figures. Some customers want powerful four- and eight-way servers slung together, requiring Intel's Xeon E7 family, while others would prefer as many two-processor nodes as they can get. "It depends on the application," says Jon Layish, president at Red Barn Technology Group. "Of course, there's a big premium on the cost of the CPUs when you get into those quad-processor configurations. We've found that biotech, bioinformatics, and math applications seem to like the four- and eight-socket platforms. But that's usually a customer working on one box. When it comes to building clusters, we definitely do more in the two-processor space."

3Connectivity For Days. Each of the H2000’s four nodes hosts a pair of gigabit Ethernet ports and optional InfiniBand, along with add-in expansion. Dual power supplies drive up to eight 130 W CPUs.As scientists and professionals continue looking for ways to squeeze even more FLOPS from their supercomputers, the elephant in the room is GPU-based computing. Up until recently, CPUs were used in a parallel fashion—in a single chassis as two-, four-, and eight-processor servers and as clusters of servers harnessed together. It turns out, though, that the same modern graphics processors typically found in high-end gaming desktops are also quite adept at floating-point math. And with hundreds of programmable cores per GPU, it's appealing to consider the possibilities of multiple graphics cards driving parallelized scientific workloads. There's one big hurdle in the way, though. In order for applications to take advantage of the GPU, they have to be written as such. That's a significant software development undertaking. Moreover, the gains attributable to GPUs shrink quite a bit when you run computations that aren't inherently suited to the architecture in question.

Intel's solution to that conundrum is its Many Integrated Core Architecture, known as MIC. Using x86 technology manufactured on a 22 nm process, the company recently fit as many as 50 cores on a single piece of silicon. That product, called Knights Corner, targeted HPC segments like oil exploration, scientific research, financial analysis, and climate simulation. A development kit called Knights Ferry introduced the developer tools needed to get existing x86 code running on the MIC.

The integrators with whom we talked seem to agree that there's plenty of potential to propel HPC performance using Intel's MIC and standardized programming languages. "I think you might see a lot of people thinking that if they just get a system with Knights Ferry in it, I can buy the Intel software tools or use the tools that are provided free with whatever Intel eventually brings to market, and end up with well-accelerated code after very little extra effort," suggests Microway's Newman. "I think there will be a very low barrier to entry for the x86 folks." Abhinav Chawade, an HPC engineer with solution provider AMAX, is even more convinced. "Industries like oil and gas have tons of proprietary code written for x86, and they don't want to port that software over to run on a GPU."

Connecting Clusters

4Four Nodes In 2U. Intel’s upcoming Server System H2000 is compact, and yet it hosts up to four compute nodes, each with up to two Xeon E5 CPUs. The cumulative compute power is immense.At their heart, the nodes of a cluster are made up of the same hardware components found in other servers. Intel claims that its processors are used in more than 80% of the world's supercomputers, so you're talking about the same Xeon CPUs with which you're already familiar. Perhaps the most foreign concept is the fact that the nodes in a cluster operate on workloads in parallel, and consequently need to communicate at very high speeds. That makes networking technology particularly important.

"If you look at the clusters in the Top500 list, they're dominated by InfiniBand-based networking, there's some 10 gigabit Ethernet, and there's a little gigabit Ethernet in there as well," says Microway's Newman. "But InfiniBand is the leader in there, and that's a function of the fact that you need a big pipe and very low latencies." Specifically, IDC says that, in 2010, InfiniBand had 44% of the HPC networking market. Ethernet had 38%, while 18% of systems employed proprietary interconnects. The company also predicts that, by 2014, both InfiniBand and Ethernet will have 43% share. "That's reassuring," says AMAX's Chawade, "because with 10 and 40 gigabit Ethernet, plus RNICs, InfiniBand has some competition."

InfiniBand is a point-to-point, switched I/O fabric designed with scalability in mind. InfiniBand supports remote direct memory access, which allows data transfer from one machine's memory into another's, circumventing the operating system's kernel on both ends. This is what enables the very low-latency networking that clusters require. Cabling everything together is fairly easy with short runs of copper. However, configuring InfiniBand can be more challenging than Ethernet in that the driver model is different.

Getting Cluster-Ready

5Plenty Of Density. Armed with 16 DDR3 memory slots, the S2600WP motherboard enables incredible memory capacity for HPC-oriented workloads that really depend on large pools of RAM. Building up clusters, linking nodes together, and supporting what has the potential to be a sizeable infrastructure necessitates experience, regardless of whether you have an extensive background in other server segments. Intel takes the pressure off of picking parts and putting them together by offering its Cluster Ready program to OEMs and system integrators. Although, as a reseller, the initiative's primary focus would seem to be of benefit to the channel, it really serves to make supercomputing more accessible all around. Historically, Beowulf-style clusters, composed of commodity hardware, were put together by universities or government labs to fulfill a specific purpose. Those same organizations built, provisioned, and used the cluster. Under Intel's program, an integrator can construct a turnkey solution, freeing users from the burden of designing and building. Clusters manufactured identically in volume are also more economically viable than one-off designs, so long as they still meet the needs of those customers.

And so, Cluster Ready consists of many different resources to facilitate that process, beginning with a specification that sets forth hardware and software components that define a Cluster Ready platform. It's a comprehensive document, going so far as to include the management of nodes, resources, and node provisioning. Although it's bound to Linux as an operating system kernel, the spec aims to be flexible enough that you could use SuSE, Red Hat, or Ubuntu.

As you start working your way through some of the hardware and software integral to the cluster world, Intel's guide to creating and maintaining cluster recipes relieves the effort of striking out from scratch on each and every order. In practice, Intel is compelling you to document processes and procedures well. But by starting with Intel's Cluster Ready specification and validating it with the Cluster Checker tool, you also get to pin an Intel certification onto your configuration. As hardware and software change over time, another set of guidelines makes it easy to update validated recipes. The company doesn't even leave you to start from scratch with your first recipe, either. Going several generations back, it regularly posts pre-certified reference cluster designs as a foundation for future development. Those configurations feature processors, specific motherboards, cluster management software (though you're responsible for procuring it), and an operating system. For the Xeon 5600-series, six builds let you mix and match a number of software components certified with more than 30 nodes each.

Believe it or not, there's another option aside from the Cluster Ready reference builds. We've discussed Intel's Enabled Solutions Acceleration Alliance before—the initiative that sets forth to provide VARs with pre-validated configuration guides for a number of different applications. ESAA also has a section dedicated to high-performance computing, which gets significantly more diverse. Remember the discussion about setting up InfiniBand and how it's more involved than Ethernet? Well, ESAA has recipes available for Mellanox, QLogic, and Voltaire InfiniBand solutions. Flipping through them, you find detailed instruction on how to get hardware and software running smoothly. It also has guides available for a number of the company's other boards, including (and this is a big one) its Modular Server Compute Module. If you're ready to jump into HPC today, Intel provides plenty of ways to turn its components into potent high-performance clusters. What's to come, however, is even more thrilling.

Intel's Next-Gen Platforms

6Connectivity For Days. The back of Intel’s S2600WP boasts two gigabit Ethernet ports, USB, and an option for high-speed, low-latency InfiniBand. Additional features can be added via PCI Express.Unveiled at Supercomputing 2011, Intel's next-generation server platforms promise to build on what the company is already doing with its ESAA program to empower the channel with even more purpose-built hardware.

Sitting front and center is the Intel Server System H2000, which emphasizes HPC's focus on density by finessing four independent, hot-pluggable nodes into a 2U enclosure. Each of those nodes boasts a pair of upcoming Xeon E5 processors. Although they haven't been formally introduced yet (that'll happen in the first half of 2012), Intel already revealed that the Xeon E5s will include four memory channels capable of supporting up to DDR3-1600, totaling more than 50 GB/s of bandwidth per processor. Hardware-based support for Advanced Vector Extensions (AVX) effectively doubles the number of floating-point operations per clock each core can execute, and lots of on-die PCI Express support translates into rich connectivity, even in the face of compact dimensions.

The Intel Server System H2000 will be available with a couple of different motherboard options, depending on specific customer need. Both feature twin LGA 2011 processor interfaces, lots of I/O, and support for the same InfiniBand controller. However, the S2600JF specifically is optimized for memory bandwidth. "It has eight DIMMs," says Garrett McKibben of Intel's Enterprise Platforms and Services Division. "And it supports the processor family's maximum 1,600 MT/s data rate." The S2600JF also proffers up to three 16-lane PCI Express 3.0 slots, two of which are accessible in the Server System H2000. One slot is roomy enough for any low-profile add-in board. The other is intended to take one of Intel's I/O Expansion Modules—PCI Express-based cards in a custom form factor. Whereas most high-density configurations offer a single slot, the I/O modules are small enough to facilitate a number of additional connectivity options, including: two-port gigabit, four-port gigabit, and two-port 10 gigabit Ethernet; a QDR InfiniBand port; remote access and KVM support; and a number of four-port SAS controllers with RAID.

"In addition, the board has the option for InfiniBand-down," continues McKibben. "That's either QDR or FDR InfiniBand from Mellanox, integrated on-board." Boasting up to 12 GB/s transfers and less than .7 microsecond latency, performance-sensitive clusters stand to benefit from such a high-speed interface, though a pair of gigabit Ethernet controllers comes standard as well. The future Xeon E5 family's PCI Express 3.0 links will play a particularly important part in enabling FDR (or Fourteen Data Rate) InfiniBand. In a four-lane configuration, you're looking at 56 Gb/s of throughput, which necessitates a third-gen connection for vendors building eight-lane cards.

7Taking Care Of Business. Building clusters takes more than recommending the right components. Effective cabinetry, power, and cooling also make this AMAX implementation as attractive as it is. Customers who need more memory density can instead go the S2600WP route. Equipped with the same pair of LGA 2011 interfaces, this board also comes armed with twice as many DDR3 memory slots. It's a little wider, but it still fits into the Server System H2000. That extra space creates room for a fourth 16-lane PCI Express 3.0 slot. Although, again with the option for QDR/FDR InfiniBand, you might not even need all of the board's add-in expansion.

Despite the compact dimensions in which they're forced to operate, Intel engineered its S2600JF and S2600WP motherboards to support a pair of 130 W processors. The result, of course, is incredible peak performance, as resellers are able to install any of the upcoming Xeons. When the platforms aren't being asked to run at full speed, their high-efficiency voltage regulators dynamically shut down unused power phases to cut back on consumption. And of course, the motherboards support Intel's Intelligent Power Node Manager software, which reports system-level, processor, and memory power, and limits energy use based on administrator-defined profiles. The H2000's cooling is handled on a per-node basis. So, if one of the system's fans goes out, a single node throttles down to protect itself. The other three nodes continue running at full-speed, though. That's a unique differentiator. Competing systems employ larger fans tasked with cooling the entire server system. Without the more granular thermal isolation Intel enjoys, though, any loss of airflow potentially impacts the performance of all of the enclosure's nodes.

It certainly helps that the H2000's high-efficiency power supplies minimize the amount of power lost as heat. Intel gives you the option to use 1,200 or 1,600 W models, both of which are rated for greater-than 92 percent efficiency and bear 80 PLUS Platinum certifications. Should you go the 1,600 W route, cold-redundant capability is built-in to help maximize uptime (even if a power supply goes out). "Every 1,200- or 1,600-watt power supply is not created equal," says Intel's McKibben. "Ours, for example, you can flash its firmware and update. Cold redundancy means the second power supply is not even on until it needs to come online to back up a failure from the primary source."

McKibben points out that features like those are useful in any server environment. Naturally, what makes the Server System H2000 so ideal in an HPC application, specifically, is its density. With four nodes per 2U chassis and two processors per node, you're looking at up to eight of the fastest processors ever seen from Intel. With up to eight cores per CPU, that's a maximum of 64 physical cores, Hyper-Threaded, totaling 128 threads in flight at any given time. Then, with as many as 16 DIMM slots per node (on the S2600WP board), you have plenty of room to deploy massive memory footprints. "Our customers asked us to go do this because they want more choice, and this is a delivery of the promises we made to them a year ago when we said we'd triple our roadmap."

8Locked And Loaded. Installed in a frame for the Server System H2000, the S2600JF motherboard’s PCI Express riser card and I/O module interposer card are both clearly visible. Already, the results are speaking for themselves. "When the most recent Top500 list was unveiled, there were 10 systems based on the Sandy Bridge architecture that made it," McKibben says. "Six of those leverage our boards. Four of them center on S2600JF." Intel's software and solutions group even has its own cluster in DuPont, Washington comprised of 357 Xeon E5-based nodes that takes 104th place on the prestigious list. Obviously, that one employs Intel boards too.

Soon we'll be able to go into more depth on the Xeon E5 family, along with the other server boards packing the EPSD group's portfolio (not to mention new enclosures, server systems, and add-in cards). Until then, know that the company definitely did listen to its customers, creating a broad family of platforms intended for critical markets like HPC.

Regardless of whether you jump headfirst into HPC now or wait until the next generation of targeted systems emerges, Intel is already ratcheting up its support programs to ensure more robust coverage. The Server System H2000, for instance, is covered by a three-year warranty. However, an extended warranty package can add two more years on top of that, minimizing the cost of support once the standard guarantee lapses. The same protection is available for existing

EPSD boards as well, satisfying some of the academic and government contracts necessitating long-life deployments.

We've also discussed Intel's new on-site repair service, which is already available, and of course will become an integral part of backing upcoming Xeon E5-based servers. "If I'm at a doctor's office, for example, and I'm signed up for this service, when my computer goes down, I call my reseller," says Thor Mitchell, marketing operations manager at Intel. "They're going to say to get Intel involved, and a technician will diagnose down to the specific failed part and replace it." The ability to offload the support side of HPC is a big deal to resellers who're either more focused on the sales side or software sides. Moreover, utilizing Intel's reach as a service provider means extending your own. If the ability to work on hardware previously kept you constrained to a tight geographical area, on-site repair lays a foundation for work with new customers. In fact, Intel's techs blanket the entire United States, including Alaska and Hawaii.

Establishing A Foothold

9Fundamental Building Block. Intel’s S2600JF motherboard has everything you need to build potent customers, supporting two Xeon E5 CPUs, eight memory slots, three PCIe slots and InfiniBand.The world of high-performance computing is decidedly intense. "HPC isn't a always an easy business to be in when it comes to the knowledge required," says Microway's Brett Newman. "You really have to prove your expertise to your customers." Scientific research and engineering applications are far more demanding than anything you'd ever find on the desktop, or even the SMB space. "Customers call up with very technical questions, and the vendors who succeed are the ones who really know their stuff." AMAX's Chawade provides an example: "Washington State University is doing research with Apache Hadoop, a software framework borderline between the enterprise and HPC segments to study the application of distributed graph processing methods to identify protein clusters and peptides from large scale mass-spectrometry data. This is an application of computer science primitives to research problems."

Companies like Intel are taking the initiative to make HPC more accessible to its resellers by distilling down the configuration aspect. ESAA and Cluster Ready recipes already exist for Xeon 5600-series-based platforms. With the introduction of HPC-specific motherboards, you can bet the upcoming Xeon E5s will see even more emphatic support from the company. The hardware side of HPC is more intuitive than ever, but you can expect it to continue evolving through 2012. The successful VARs are the ones able to take the latest technology and turn it into right-sized solutions that address some of the most demanding applications around.

John Martinez

John Martinez

is a member of the Intel Channel Board of Advisors and is the Editor-In-Chief of Retail Advocate Magazine / TechInsight.

SKU's That Can't Lose

sideBar