Information Technology and Computation 

For Science in the 21st Century

Peter Quinn, ESO October, 2000

1. Introduction

Two revolutions have taken place in the way we do science over the past twenty years. The Computational Revolution of the 1980s and 1990s added computation as a third component to the classical research methodology of theory and observation. Because computers simultaneously became fast, cheap and easily available, a large fraction ofresearchers had the opportunity to include computational modeling, analysis and visualization as part of the daily pursuit of insight. The Information Revolution of the 1990s gave researchers the chance to share the effort, investment and rewards of research programs across the globe. The wide and cheap distribution of the WWW infrastructure for connecting and distributing information and software, opened the flood gates for individuals and organizations to partake in global science, commerce and culture.

While both revolutions feed off each others success, the drivers of each remain distinct. The PC market, video games, desktop publishing and the open source UNIX/LINUX paradigm continue to drive up the performance and software capabilities of desktop machines for a fixed investment. The doubling time of computer performance (Moore’s law) is approximately 18 months (Figure 1).

File written by Adobe Photoshop® 4.0

Figure 1 : Moore’s Law for computer performance growth

E-commerce, the globalization of information from science, art and culture and the outreach activities of individuals, organizations and governments has driven the WWW to grow twice as fast as computer performance (doubling time for number of sites ~ 9 months Figure 2).


Figure 2 Growth of Number of Web sites 1996-2000.


If the information volume continues to grow faster than the intrinsic growth of the performance on a desktop, then clearly we require new strategies for information processing and management. This is true regardless of the speed with which we can deliver that information to the desktop. There must be some fundamental changes in the way we develop and utilize information. The pursuit of international scientific research programs highlights specific drivers for these changes as well as strategies for the unification and reconciliation of the Information and Computation Revolutions.

2. Five Drivers for Change 

2.1 The Data Explosion

Over the past thirty years, improvements in radiation detector systems and digital electronics have move scientific instrumentation from being largely analogue to predominantly digital. The impact of this trend on the data volumes that are produced by scientific instruments has been dramatic, particularly in those sciences that depend on imaging data. In astronomy, the number of Charge Coupled Device (CCD) pixels looking at the sky has grown 100 times faster than the area of available optical telescope mirrors in the last 30 years (Figure 3). Since new facilities like the ESO VLT come online at a particular point in time after a long (~10 years) development phase, the instantaneous impact of the increased data volume on users is explosive. In the period from 2000-2003 the typical data sets delivered to users of the VLT will increase in volume by two orders of magnitude (to hundreds of gigabytes per night). This represents a doubling time of less than six months. The same is true of Earth Resources researches following the launch of new generation satellites which now deliver terabytes of images per day or the switch-on of a new particle physics detector (like the Compact Muon Soledoid CMS) with instantaneous data rates in the tens of terabytes per second. In the first decade of the 21st century European scientists will encounter several of these dramatic increases in available data volume (e.g. completion of VLT in 2002). The current incipient growth of computer power, storage and network speed (following Moore’s law or worse) is a factor of approximately three too slow to address these data volume explosions.


Figure 3 : Total area of 3m+ telescopes in the world in m2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels (A.Szalay, JHU).

2.2 New Science

The increase in data volume generated by scientific instruments is not just a function of greater technological capabilities. Both astronomy and physics are engaged in data intensive research. The most energetic events in the Universe (Gamma Ray Bursters) occur at random and can brighten to naked-eye brightness for only a few seconds. Potentially Earth-destroying asteroids hide among tens of millions of faint point-like sources distributed over the entire sky. Higgs particles may be found by one single track out of petabytes of particle detector data. The search for rare events in the physical sciences means scanning billions or trillions of individual events to find the one that verifies the theory or unlocks new physics. Similarly, many new discoveries are found through the statistics of large samples. Statistical criteria for the detection of parameters at low contrast to the background drive up the data volume. The fluctuations in the mass density of the Universe as measured by the large scale distribution of galaxies on the sky, requires millions of individual galaxy measurements to obtain the necessary accuracy to distinguish one cosmological theory from another. 

The size and uniformity (integrity) of large data samples from new generation particle detectors, telescopes or satellites makes it possible to accurately cross compare data across time, wavelength (energy) and resolution. Data repositories can be mined and new types of data (meta-data) can be formed. The usefulness of data is multiplied by putting it to uses beyond its original intended experimentation purposes. The ability to cross-correlate large uniform surveys of the sky in the optical with similar surveys in the infrared or radio will enable us to study the digital Universe without the bias and imposed selection effects than come from only having complete data through one wavelength window. The data management and data processing facilities to do these global explorations of databases will only be developed by breaking the doubling-curves for processing power, storage and bandwidth. 

2.3 The Cost of Software

With the move of scientific instruments from the analogue to digital world, the need for scientific software has greatly increased. Data analysis, visualization, instrument control, statistical, graphical, program management and web-based software are some of the many typical development programs that comprise major research efforts. However, the resources necessary to produce and maintain this software are increasing with time in both an absolute and relative sense. Writing and maintaining large software systems requires individuals with skills that are highly marketable. In many areas (particularly database applications and the WWW) industry can offer individuals many times the salary that can be afforded by scientific research projects. Hence the cost of producing, testing and maintaining a given number of lines of code increases with time due to market forces. Furthermore, the fractional cost of the software development and maintenance in a given project has increased by a large factor (for industry : from less than 25% to over 80%) over the past 15 years.

Figure 4 : Fractional increase in the cost project software components over time

This is due to the decrease (Moore’s law) in the cost of a given computer performance and storage solution relative to the associated manpower effort. Several international research programs have made major tactical errors in planning by not allocating a sufficiently large fraction of resources to software. If this trend continues, then an obvious strategy is to find mechanisms by which software developments can be reutilized and globalized over many different programs in different areas of science. The access to large databases, their data management and data analysis challenges, are common needs across astronomy, particle physics, biology and Earth resources management. The gathering of clearly defined, common requirements across these disciplines in key software areas would enable the definition of infrastructural software development programs that could serve multiple research areas and greatly lower the per-project software cost.

2.4 New Audiences

The growth of the WWW has enabled the world of science to be opened to new audiences. Every day, people from all walks of life, surf the net for pleasure and education. People want to be involved in what science is doing and they want to have their questions answered. A large fraction of the hits on the ESO web (approximately 50%) access material prepared explicitly for education and public outreach. The SETI At Home Project from UC Berkeley allowed 2.3 million people the chance to generate 3.5x1020 floating point operations of computer power for one scientific project.

As the capabilities of desktops increase and as the data management and exploration systems necessary for research come online, we will have new opportunities to provide people with chances to learn and to be involved. Distributed, digital astronomical surveys of the entire sky could be presented as a digital Universe in which students and the public can design their own telescopes, create their own research programs and even find something new and wonderful that can only be done with the distributed effort of many keen explorers. Linking the public with the kinds of information and computation infrastructure necessary for research can only further multiply the scientific and cultural return.

2.5 Optimizing Returns

Programs like VLT, ALMA, LHC and the Cornerstone Missions consume large fractions of the total available research funding in particular disciplines for periods of order ten years. The scientific return on these investments must be maximized. The VLT and many space-based astronomy projects, are investing approximately 10% of the total available science time and 20% of the project costs to ensure archives exist that are filled with data which are useful and usable for the future. In the case of the HST, this strategy has already lead to a multiplication by a factor of four in the scientific usefulness of, and return on, HST program data. By linking archival resources, and providing the tools to exploit them, the total scientific return on program investments will be greatly multiplied. This will increase the marketability of any program that contributes data to the communal meta-data resource. 

3. Essential Strategies and Technologies

Globally, the growth of the volume of information, and the demand to access information, are growing faster than current technologies can service on the desktop of the international user community. This is particularly true in the sciences where data-explosive events (the arrival of new facilities after long development periods) punctuate the crisis that smoothly grows worse for the world at large. This situation then calls for a new paradigm encompassing the evolution and joining of the Computation and Information revolutionary processes. For the past several years scientists have lead the way in discussing this crisis and paradigm shift. What has naturally emerged is the concept of a computation/information GRID in which resources (hardware and software), as well as information, are available in a distributed system. In this way, compute resources can be developed at specific locations to service the needs of users without the need to enhance the individual desktop resources. Furthermore, this implies that large data volumes need not be shipped to desktops but to these resource centres where data refinement and volume reduction takes place before movement to the desktop. The concept of specialized resource and data nodes also makes it natural to share software developments by providing large sections of the user community with a common group of tools and systems supported at a specialized location. At the same time, the commercial sector has found its own solutions to the same problems and the implementation of the GRID paradigm will certainly utilize commercial developments. The GRID concept allows us to address all of the five specific growth drivers outlined in section 2. The following topics highlight the essential ideas and technologies that will form major components of the deployment strategy for GRID programs.

3.1 Interoperability

The rise of the WWW was made possible by the development of a common software infrastructure for locating (URL) and storing information (HTML) on a distributed set of machines. These systems allowed information providers and information distributors to interoperate. It gave anyone with information to distribute, a common interface to the internet that was bi-directional. The GRID concept requires this infrastructure to be expanded to include resource provider and users as well as information providers and users. We need to be able to specify what kinds of resources are available at some location (storage volume, compute power, specialized software, specialized hardware etc.), what kinds of services they provide, and how they accept and return information. This has to occur with the same transparency of the WWW and with the ease of current browser technology. A large fraction of the initial investment in the GRID paradigm must go into the development of a resource interoperability layer that melds with the current WWW information infrastructure.

3.2 Data and Service Hierarchies

As mentioned above, the ability to move data to the desktop is not growing at the same rate as the volume of data nor at the same rate as computer resources on the desktop. The doubling time of connectivity to the desktop (Nielsen’s Law) is approximately 20 months (Figure 5). The compound difference with Moore’s Law over ten years results in a relative growth in computer power that is twice the relative growth of connectivity bandwidth.


Figure 5 : Nielsen’s Law for connectivity growth

The paradigm shift necessary to address this problem has two parts. Firstly move the data to distributed resources so that the content can be served over a multitude of channels in parallel rather than from one central sources. Secondly, do not try to move all the data but rather those subsets or refined data products that are of particular interest to the end user. This paradigm shift has already occurred in the commercial market place. Large organizations like CNN cannot meet the download needs of all their customers from their central site in Atlanta. The congestion on internal US networks, and more importantly, the limited international network bandwidths, would reduce the flow of information from Atlanta to a trickle and send customers to other, probably local, providers. Currently CNN uses content management companies (e.g. Digital Island, Akamai) to copy and manage their content on tens of thousands of servers distributed over the globe. When you download a page from CNN it comes from the nearest, fastest available, local content provider in a transparent manner. Furthermore, using the browser on your PC, you can tailor the delivered content (e.g. My CNN.com). This tailoring is done via distributed resource engines that predetermine the locally delivered content, again in a transparent manner. The local CNN content and tailoring servers are distributed data and resource nodes in the spirit of the GRID paradigm.

Figure 6 : The GRID paradigm in action : CNN content and tailoring services distributed globally by specialized nodes coordinated from a central organization.

Data processing and storage capabilities of computers both follow a Moore’s law with the same doubling time. This implies data volume growth will saturate the resources of any given single processor or storage device. The increase in performance at a fixed cost and the economies of scale in the PC market place, have lead to a natural rise in the utilization of parallelism for computation and storage to beat the data volume problem.Since the early 1980s, scientists have been active in using connected systems of identical computing nodes (workstations or PCs) to solve particular kinds of problems in an efficient manner. The current performance records for computational and bandwidth performance (Gorden Bell awards) are held by massively parallel machines from the commercial sector (SGI Power series) or following the Beowulf paradigm utilizing commercial off the shelf (COTS) PCs. By distributing the computational load and the data volume across networked computers and storage nodes, we achieve a scalable system that can be grown to meet the challenge presented by rapid data volume growth. The development and exploitation of parallel storage and processing architectures will form an essential part of GRID development as outlined in existing GRID-oriented proposals from CERN and ESO. This development will lean on current efforts in the commercial sector (e.g. Microsoft storage server arrays) and initiatives in the research community following the Beowulf paradigm. 

The distribution of data to specialized nodes and the utilization of parallelism to provide node resources and storage capabilities are essential concepts within the GRID paradigm. The design of data and resources hierarchies will be among the first steps in studies of GRID systems for physics and astronomy (e.g. Computational GRID Testbed for Particle Physics and the Astrophysical Virtual Observatory). 

4. Conclusions

The growth rates of computer power, data volume and data bandwidth define a three parameter universe in which international scientific programs will have to be born and flourish. Given the relative sizes of these three “cosmological parameters”, the information/computation universe will collapse around us, smothering our ability to conduct large projects, unless we adopt particular strategies for development which avoid the big crunch. The particular nature of international scientific research programs (“punctuated” evolution of resources) highlights the need for a new paradigm.

The data volume problem generated by new instruments and new science, the cost of software development, the need for outreach and the need to maximize return on investment can all be addressed by a common strategy. The development of a GRID of storage and processing resource servers which melds with the WWW will enable data volumes to be managed and processed with COTS-based parallelism. It will further allow software costs to be shared by serving needs from central nodes, it will enable the public to reach new resources through the WWW and will provide the channels to support distributed archives for long-term data mining.

The construction a virtual astronomical observatory is an ideal example of a GRIDinitiative for astronomy. The observatory is a centralized resource for doing science, it provides coordination and balance of resource and infrastructure growth and the opportunity to aggregate resources (telescopes) for a particular program. The “telescopes” are the distributed and specialized data and resource nodes (e.g. data centres for particular missions) and the “astronomical instruments” are the software tools determined by scientific requirements and configured by users at their desktops to conduct research programs. This scalable hierarchy of capabilities mirrors developments in the commercial sector and addresses the long term data management, analysis and archival challenges facingastronomers. Many of these astronomical needs are shared by programs in particle physics, biological, socialand Earth sciences.