Examples of SS/T include stand-alone entities like compilers, operating systems, and debuggers, as well as APIs (Application Program Interfaces, or software that simply provides an interface to lower-level functionality) such as math libraries, I/O libraries, and operating system command languages. Software such as Matlab, Nastran, and WaveFront, on the other hand, are not SS/T but applications built using SS/T.
The line dividing applications from SS/T is not always sharply drawn, and debate over specifics will continue to occur. But SS/T, by providing essential software infrastructure for developing and executing applications, is central to all HPC endeavors.
Reliability and robustness are difficult issues for the HPC community. User organizations often require delivery of new systems at the earliest possible date. Because of the complexity of parallel systems, however, system software and tools are complicated, and early delivery may mean that they are relatively untried. As a result, the users' initial contact with the software is often quite negative, a situation that improves only slowly. Yet without certain guarantees that software will be reliable and robust, it is difficult to attract new users and new applications.
From the perspective of the user community, it may appear that vendors are unresponsive to user needs. From the vendor's perspective, all available resources are often consumed just to "keep up" as market forces require new systems, languages, and features. Each vendor has a number of outstanding high-priority requests; these often conflict with one another, whether they come from different user sites or from different sources within the vendor organization itself. Vendors must also expend effort differentiating their product from those of competitors.
Even more insidious is the problem that SS/T are typically frozen within a year or two of a machine's release. Products on that machine will not be evolving to meet new needs, while later machines are likely to have a dramatically different software base. This means that the efforts spent by application developers in learning and adopting to a particular machine will only pay off for a brief period.
Compounding the problem is the fact that few, if any, of today's users have the luxury of developing applications for a single platform. The rapid rate of change in HPC technology requires that codes be migrated from platform to platform in relatively short order. SS/T is notoriously inconsistent from one machine to another. An application developer who recognizes the need to port a key application to multiple machines is faced with a long, steep learning curve for each. Considering the short lifespan of HPC machines and their SS/T, this inconsistency is a strong deterrent against any non-governmental organization investing significantly in the development of parallel applications.
It was this problem -- the inconsistency of SS/T on HPC machines -- that the task force was convened to address. Specifically, it was charged with establishing the basic requirements for a "standard" software infrastructure that would support the development of parallel applications. At the present time, standards (whether official definitions sanctioned by a standards organization or de facto standards that evolve from grass-roots efforts) are the primary mechanism for enabling application portability. Yet it is clear that vendors alone cannot be expected to develop standards; standardization actually works against their interests, in the sense that it masks the product differentiation they need to maintain competitive position. Since a standard is useful only if it is actually implemented across a range of vendor platforms, it is the responsibility of the user community to establish the need for standards, participate in their definition, and provide vendors with the proper incentives to implement them.
Extending the notion of standards to a program development environment is not a new idea. Several recent workshops and conferences (e.g., HPCC Grand Challenges Workshop, ARPA/NSF Workshop on Parallel Tools, ACM/ONR Workshop on Parallel and Distributed Debugging) have discussed this need and proposed that task forces establish how SS/T might be specified for inclusion in Requests for Proposals (RFPs). If such a standard existed, HPC providers would be able to anticipate requirements, and hopefully deliver a robust working environment earlier in the cycle of new system releases. If multiple vendors could be induced to implement the environment quickly, users would at last have access to a robust environment that is consistent across multiple HPC platforms.
In recommending the formation of this task force, the Second Pasadena Workshop on System Software and Tools for High-Performance Computing Environments noted that there were several key issues to be addressed if such a standard were to be effective:
Participants were divided into three working groups to address three levels of support: operating system and system administration support (led by L. A. Tanner of NASA/Ames); low-level programming interface, primarily libraries (C. Pancake of Oregon State); and high-level programming environment (T. Welcome of LLNL). The users were constrained to applications of known technology; that is, they were directed not to include capabilities that would involve the development of new technology. Plenary discussions were used to identify priorities from the group as a whole. After the meeting, each of the three subgroups spent a month conferring electronically to refine their lists. The result was approximately 150 capabilities, of varying levels of detail, each associated with a very fuzzy priority level (to assist in the next step).
Those capabilities then served as input to the second step, where SS/T developers from industry gathered to review the capabilities. The meeting, which took place at Caltech in August, had the objective of clarifying the capabilities and assigning an approximate "level of effort" indicator to each, reflecting the relative cost of implementing that capability today. The ten major HPC machine providers (Cray, Convex, Digital, IBM, Intel, HP, MasPar, Meiko, SGI, and Sun) were invited to attend; each was offered three seats, which they could fill internally or delegate to their independent software vendors (ISVs). Participants included Applied Parallel Research, BBN, Convex, Cray Research, Portland Group, IBM, Intel, SGI, Sun, and Tera. Five user representatives from the previous meeting were also on hand to clarify the intent of the capabilities.
Much of the three-day period involved discussions of what was really needed vis-a-vis existing software bases for HPC platforms. In many cases, vendor representatives indicated that relatively slight modifications of user requests would make significant differences to implementation costs. The group also identified dependencies among the capabilities (e.g., "capability X will be short-term if Y is done as well, but will require twice as long if done alone"), and perceptible cost thresholds (e.g., "capability X will be short-term if users will accept version 3.2, but will be twice as costly if updates are needed for the semiannual version releases"). Many items were subdivided in this process, yielding a total of over 225 capabilities. After the meeting, vendor representatives continued the discussions by email, arriving at collective estimates of the extent to which capabilities were already implemented, industry-wide, and the investment that would be needed to complete other capabilities.
The third meeting, which took place in Santa Fe the end of September, used those level-of-implementation and level-of-effort estimates as the basis for prioritizing user requirements. Participants were primarily user representatives, but a few of the vendors attended in order to answer questions about the cost estimates. The group reviewed all the cost-associated capabilities with the objectives of ranking them into categories and formulating them as guidelines for drafting procurement requests. Each capability was either dropped from consideration (typically because the cost was higher than its perceived usefulness warranted), identified as an essential element of the Baseline Development Environment that should be present on any HPC platform to provide a consistent program development infrastructure, or associated with one of three priority levels indicating its likely importance to significant numbers of user sites.
A subsequent period of email discussion yielded 55 baseline capabilities, plus 105 additional priority items. The numbers themselves are not particularly indicative, however, since individual capabilities vary considerably in scope.
The task force effort was endorsed by several national groups, including the Parallel Tools Consortium and the Scientific and Engineering Computing Working Group of the National Coordinating Office for HPCC. Over sixty representatives from major user sites and commercial software vendors participated in the three meetings and extended email discussions. None of the effort would have been possible without the enthusiasm and dedication shown by these individuals and their organizations.
It must be noted that user and system requirements evolve continuously. Therefore, a time limit of two years has been imposed on the guidelines laid out in these documents. The task force participants strongly urge the sponsoring agencies to establish a second task force in mid-1997. That group should thoroughly revise the capabilities, taking into account not just technological advances, but the experiences of agencies that have applied the guidelines in their procurement procedures.
To this end, the organizers encourage groups making use of the guidelines to furnish comments and constructive criticisms based on their experiences. The following contact points should be used: