Despite dramatic increases in the computing power available to engineers and designers over the past 20 years, there remains a large appetite for more among those working in mechanical computer-aided engineering (MCAE). Recent technological advances such as increases in processing speed and the packaging of processing chips in a workstation, however, have made it possible for your workstation to become your low-end server. Typically, a server is assigned to perform tasks on behalf of the user and is located in a strictly controlled data center environment physically separated from the user. This ensures stable operating temperature, humidity, power, and restricted access. Despite all the benefits of a dedicated data center infrastructure, however, a workstation located at a user’s desk can be a tremendous computing asset because it allows for individual control. CAD and CAE have been merging over the last 15 years or so. As the cost and ease of use of MCAE software has improved, engineers who are CAD specialists have gained access to integrated analysis tools. Their jobs have grown from primarily creating CAD models to doing some preliminary analysis on their designs, perhaps before more sophisticated analyses are performed. These engineers can now perform multiple tasks on the same workstation all within the same application framework. The data can be easily transferred back and forth between CAD and MCAE applications. A Case in Point A modern workstation such as the Sun Ultra 40 M2 from Sun Microsystems contains two AMD Opteron processors. Each of these processors (the 2200 series) contains two cores or individual processing units. Current AMD technology enables two cores to be placed on a single chip. In the future, more cores will be placed on each chip (see sidebar, “Scaling to Multiple Cores” below). Thus, a modern workstation can contain four processing units (two sockets each with two cores per socket). There are certain workstation models on the market with a total of eight cores today. With each core able to perform 5.2 gigaflops (5.2 billion floating-point operations per second) as a measure of peak performance, a single workstation could perform 20.8 gigaflops. Given its compact size and quiet operating environment, this type of workstation can surely be used for small to medium MCAE tasks. | | |  | | Sun Ultra 40. | Since a workstation is designed with interactivity in mind and the return on investment (ROI) is measured by the amount of work that the user performs, it is important to examine what the effect is of running MCAE applications on users’ workstations already engaged in daily CAD work. On a Sun Ultra 40 M2 workstation running Solaris 10 OS with two AMD sockets, each configured with a dual core chip, for example, we first ran Olaf Corten’s OCUS (proesite.com) CAD benchmark. The OCUS benchmark runs a predefined job using Pro/Engineer from PTC. This benchmark measures CPU speed, graphics speed, memory access, and I/O, thereby generating total wall clock time needed to run the job. A subsequent run was then performed, running just the CAE application. We first added a CAE component — a popular computational fluid dynamics application —solely running on one of the cores on the second socket. Since the CAE application is multi-threaded, the job can be completed faster by allowing the application access to more computing resources. It first used one thread, and then was allowed to use three threads, fully occupying the system while the OCUS benchmark ran on the remaining core. We monitored the CAD benchmark to ensure the analysis did not significantly affect the interactivity of the CAD user. A number of interesting results were obtained when running this combined CAD and MCAE workload: • Running on its own, the OCUS benchmark ran in 1800 seconds. • The baseline for the MCAE benchmark was 4985 seconds. This was set up to use only one core in the four-core system. • Running the MCAE benchmarks using three cores, the elapsed time was 1644 seconds, or 3.03 times as fast as the single-core time (a superlinear speed increase). This scaling occurs with many analysis programs. Some applications scale well with up to four processors, while others show good scaling to 64 or even 128. With current technology, workstations today can contain up to four cores, which matches up well with certain application scaling limits. In the final test we ran the OCUS benchmark and the three-core MCAE application at the same time. The OCUS time increased to 1978 seconds, or about 10 percent. The MCAE time increased to 1691 seconds, or 2.8 percent. So What Does This Mean? These results demonstrate that workstations, even while being used for CAD design applications, can effectively be used for background CAE applications. The slowdown in performance is less than 10 percent for the CAD application. Note that this was a fully utilized system, running all out. Typically, a user will spend a significant amount of time looking at the screen, making decisions, and will not be running just an input file so that any slowdown would be barely noticed. While a workstation can be used in isolation for small analysis tasks, it is also necessary to integrate workstations into an overall analysis environment. Distributed resource management (DRM) software such as the Sun N1 Grid Engine can detect when a workstation’s resources are available and send analysis jobs to the workstation. The software matches job requirements with available compute resources to effectively integrate desktop workstations into a grid, reducing the time needed for completing analysis tasks. | | |  | | Performance Benchmarks. | The workstation design point is different than the design point for servers. Although the electronic design may be similar in terms of CPU clock speed, memory requirements, and expandability, a workstation is designed to sit in an office environment, not a data center or computer room. Since the size (volume) of the workstation unit is not as critical as for a server, which is designed for denseness, different components can be used. If the amount of heat to be removed is similar as that in a server, then a slower, larger fan can be used. This results in less noise, critical to a workstation and office environment. Workstations must also be designed to hold a graphics card and have easily accessible connectors for video, audio, keyboard, and mouse. New Add-Ons Mean a New API New developments in the area of add-on processing power for numerically (floating point) intensive calculations are also coming into play. There are three main areas where this is occurring, and there is a cost. When combining different computing platforms in a single enclosure and on the same operating system, significant programming challenges crop up. Also issues with correct versions of the OS, drivers, patch levels, etc., become apparent. The first of these areas of heterogeneous computing is the use of graphical programming units (GPUs) to support calculations. A GPU, originally developed to speed up graphics calculations, can now be used in certain application areas to speed up non-graphics calculations. Programmable GPUs are available on almost all products from companies such as ATI, now a part of AMD, or NVIDIA (see “CUDA Architecture” below). A second method to speed up calculations for floating-point intensive applications is to use an add-on board specifically designed for this purpose. An example of one that is currently performing real work at the Tokyo Institute of Technology in Sun Fire X4600 servers is from ClearSpeed. This accelerator increases the performance of certain types of applications. When combined in a system that has the ability to feed the data to it at high speeds, a ClearSpeed accelerator board can greatly contribute toward improved performance (see “ClearSpeed” below). The third type of acceleration is the use of FPGAs (field programmable gate arrays, which enable the programmer to reconfigure the gates for specific algorithms that can then lead to impressive performance gains for a specific application. | | |  | | Sun Ultra 40 with its side removed. | All three of these acceleration technologies require the developer to either use a new application programming interface (API) or to construct the application to use standard libraries that can be replaced with hardware-specific ones at run time. There is potential moving forward for improved application performance beyond what general-purpose CPUs will achieve with these new technologies. This is an emerging market. Even though a significant amount of MCAE applications will continue to be run on rack-mounted or blade type servers sitting in data centers, it is clear that recent improvements in workstation processing power allow for medium-sized data sets to be used. This enables closer integration between CAD and CAE, and will help bring products to market faster — just the ROI many manufacturers are looking for. Michael Schulman is a product line marketing manager in the High-Performance Computing group at Sun Microsystems. Michael Burke, Ph.D., is the lead engineer at the Sun Performance Lab at Sun Microsystems. Send your comments about this article through e-mail by clicking here. Please reference "Server on Every Desk" in your message. Scaling to Multiple Cores From the outset, AMD64 architecture was designed to scale to multi-core processors. In April 2005, two years after the AMD Opteron processor launch, AMD introduced the first multi-core technology for x86-based servers and workstations with the Dual-Core AMD Opteron processor. Next-Generation AMD Opteron processors are designed to offer seamless upgradability to Quad-Core AMD Opteron processors. This occurs within the existing thermal infrastructure, protecting existing investments, and providing improvements in application performance and system performance-per-watt. The multi-core upgrade path built into Next-Generation AMD Opteron processors is enabled by AMD64 technology with Direct Connect Architecture, which helps eliminate the bottlenecks inherent in traditional front-side bus architecture by directly connecting processors, memory controller, and the I/O to the central processing unit (CPU). This architecture provides low latency, high-bandwidth memory access, which is a critical feature for high-performance computing applications. —MS & MB CUDA Architecture NVIDIA Compute Unified Device Architecture (CUDA) technology is a new architecture for computing on NVIDIA graphics processing units (GPUs) using the industry’s first C-compiler development environment for the GPU. GPU computing with CUDA is a new approach to computing where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex computing problems up to 100 times faster than traditional approaches. This breakthrough architecture is complemented by another first: the NVIDIA C-compiler for the GPU. This is a development environment that gives developers the tools they need to solve new problems in computation-intensive applications such as product design, data analysis, technical computing, and game physics. CUDA-enabled GPUs offer dedicated features for computing, including the Parallel Data Cache, which allows 128, 1.35GHz processor cores in newest generation NVIDIA GPUs to cooperate with each other while performing intricate computations. Developers access these new features through a separate computing driver that communicates with DirectX and OpenGL, and the new NVIDIA C compiler for the GPU. A CUDA-enabled GPU operates as either a flexible thread processor — where thousands of computing programs called threads work together to solve complex problems — or as a streaming processor in specific applications such as imaging where threads do not communicate. CUDA-enabled applications use the GPU for fine grained data-intensive processing, and use the multi-core CPUs for complicated coarse grained tasks such as control and data management. — MS & MB ClearSpeed Acceleration is one of the hottest topics in the high performance computing community today. Clusters based on “industry standard” architectures such as Sun’s servers (based on AMD Opteron processors) now dominate the high-performance computing market, yet power, cooling and facilities issues have become a major inhibitor in the pursuit of affordable compute cycles. ClearSpeed Technology produces PCI accelerator boards specifically designed for the needs of HPC users unlike alternatives based upon FPGAs, GPUs, and games processors. ClearSpeed’s AdvanceT accelerators can sustain over 75 GFLOPS of IEEE 754-compliant double-precision matrix multiplication while averaging only 25 watts power consumption. ClearSpeed's accelerator board technology, which accelerated the fastest supercomputer in Asia to 47 TFLOPS and the ninth position in the TOP500 list of most powerful computer systems (a 24 percent performance boost while adding only one percent to the power consumption), can also be used in workstations. — MS & MB Contacts Opteron 2200 Series Processors AMD— Advanced Micro Devices, Inc. Sunnyvale, CA Advance Accelerators ClearSpeed Technology, Inc. San Jose, CA CUDA Technology NVIDIA Corp. Santa Clara, CA OCUS Benchmark Olaf Corten’s ProESite Pro/Engineer PTC Needham, MA Fire X4600, N1 Grid Engine, Ultra 40 Sun Microsystems, Inc. Santa Clara, CA Tokyo Institute of Technology TOP500.Org
|