SMPTE 595 West Hartsdale Avenue White Plains, NY 10607-1824 USA
SMPTE Publication Report of the Task Force on Digital Image Architecture September 1992
SMPTE 595 West Hartsdale Avenue White Plains, NY 10607-1824 USA Phone: (+1) 914 761 1100 Fax: (+1) 914 761 3115
595 West Hartsdale Avenue White Plains, New York 10607
The emergence of digital coding as the common language of visual communications may fundamentally change our view of the world. The extent to which this common language will affect life in the 21st century may be even more profound than the effect that the medium of television has had on life in the 20th century. Television has provided a window to the world - often real-time - for many of the 5,4 billion inhabitants of this planet. This medium of cultural an information exchange has enabled previously isolated populations to join an emerging global village - one increasingly free of barriers. The common digital language offers a unique opportunity to leverage converging technologies, such as television, computers an telecommunications, into a global communications network. Such a network would have the potential to offer a vastly augmented range of services to all system users, thus opening up new markets to all of the affected equipment and service providers.
Worldwide, there is a growing consensus that the time has come to develop standards for the television systems based on a new paradigm - appropriate for today - with forethought to future requirements. The introduction of digital technology into imaging industries, together with the widespread introduction of digital communications, creates a window of opportunity to establish a digital image architecture with unprecedented freedom of application and interconnection.
This report examines some of the fundamental issues that must be addressed in achieving a compatible set of standards enabling a globally interconnected and interoperable visual communications network. The essential concepts for this family of standards include: an open (non-proprietary) system architecture, interoperability, scalability and extensibility. It is hoped that this Report will stimulate the interest of many groups and organizations involved in the establishment of imaging standards, today and in the future, and lead to agreement on a single system, flexible enough to accommodate a wide variety of needs, while enabling worldwide interoperability.
The report was prepared by the SMPTE Task Force on Digital Image Architecture and is responsive to the Work Assignment, dated April 1991, which established the following objective:
The Report is, in essence, the outcome of a feasibility study concerning the creation of standards for digital image systems that are scalable and extensible, effecting a high level of interoperability between a diverse range of industries and applications. The work is, as yet, incomplete; however, it has already established an important though preliminary basis for a family of digital imaging standards. The Report raises many new questions and identifies additional work required to refine the concepts that form the basis of a digital image architecture. Of particular importance will be the selection of source and display refresh rates to provide performance and economic compatibility with today's television systems.
The concepts outlined can provide a basis for a modular open system architecture, in which the parameters and characteristics for each module, and the interfaces between these modules, are clearly defined and in the public domain.
Such a system should use common standard components to serve diverse needs across all affected industries. It should enable the movement of image data across application and industry boundaries without degradation and with minimum complication. This is interoperability.
Such a system should also provide the ability to adjust image parameters - temporal and spatial resolution, colorimetry and dynamic range - by varying the amount of data that is stored, transmitted, received, or displayed. This is scalability.
A digital image architecture must give forethought to evolution - to incorporate advances in technology within any module, without changes to any other module. It must be backward compatible with today's systems, and forward enabled to accomodate the technology explosions of the 21st century. This is extensibility.
The Report was prepared by a Task Force chaired initially by David Trczinski (PictureTel) and latterly by Dr. Will Stackhouse (Jet Propulsion Laboratory), with a wide participation from the computer, television, post production and telecommunications industries. A detailed list of the membership follows. The Report was considered by the SMPTE Standards Committee at its meeting of August 13th, 1992 and subsequently adopted after an in-depth review.
Will Stackhouse, (Chair) JPL Walter Bender MIT Craig Birkmaier (Editor) PCUBED Rita Brennan Apple Computer Wayne Bretl Zenith Barry Bronson (Co-chair) Hewlett-Packard Ken Davies (Ex Officio) CBC, SMPTE Gary Demos (Co-chair) DemoGraFX Hugo Gaggioni Sony Bill Glenn Florida Atlantic University Bob Keeler AT&T, Bell Labs. Thomas Leedy NIST Peiya Liu Siemens Lee McKnight MIT Robert Powers MCI Telecomms Tom Meyer Duir Assoc. Alan Reekie European Community (CCE) Richard Solomon MIT Arpad Toth Kodak David Trzcinski PictureTel Mitchell Wade DemoGraFX Ken Yang Ampex
Stan Barron NBC, SMPTE Si Becker SMPTE Rex Buddenberg Consultant Robert Burroughs Panasonic David Carver MIT Peter Dare Sony Phil Dodds IMA Charles Fenimore NIST David Fibush Tektronix, SMPTE Paul Fleischer Bellcore Branko Gerovac DEC Barry Gilbert Mayo Foundation Christopher Hamlin Apple Computer David Herbine NADC Clark Johnson Consultant Thomas Leeder NIST Bijoy Khandheria Mayo Foundation Edward Krause General Instrument Arvid Larson IEEE-USA Derrick Lattibeaudiere Panasonic Richard Lau Bellcore Bernard Lechner Consultant Michael Liebhold Apple Computer Henry Meadows Center for Telecomm. Research Francois Michaud CBC Marvin Mitchell Mayo Clinic Robert Morrow USAF Academy Robert Myers Hewlett-Packard Suzanne Neil MIT Bruce Penney Tektronix Ken Phillips Citicorp Ed Post Quark Charles Poynton Sun Glenn Reitmeier DSRC Robert Sanderson Kodak William Schreiber MIT Scott Silver Tektronix John Sprung Viacom David Staelin MIT Peter Symes Grass Valley Group David Tennenhouse MIT Greg Thagard CST Mark Urdahl IBM John Weaver Liberty Television Merrill Weiss Consultant
Norbert Gerfelder Fraunhofer Computer Graphics ISO/IEC Rainer Hofmann Fraunhofer Computer Graphics ISO/IEC Detlef Kroemker Fraunhofer Computer Graphics ISO/IEC
The Task Force, formed from representatives of the affected industries and applications, has examined the issues, setting out those that are believed critical at this time, and has modelled, for discussion, further refinement and testing, one possible approach that meets the basic requirements. It has also produced extensive tutorial information concerning the matters under consideration.
The Key Concepts of the approach are defined in Section 2, setting the conditions for image systems that are:
Section 4.0 details the critical issues in the development of a suitable image architecture meeting the stated objectives:
A model of an open architecture approach to image standards is developed in Section 5.0, one that is both compatible with the present and extensible to the future. It is based on a low order hierarchical approach, using image tiles. The model defines four levels of resolution and takes account of a number of possible aspect ratios currently in use. Additional analysis is provided regarding the selection of an appropriate family of image acquisition rates and display refresh rates. Finally a scalable coding approach is proposed that offers the ability to produce image data in packages that can be combined to produce images at a variety of spatial and temporal resolutions.
The Task Force is expected to be of interest across a wide range of industries and applications. Section 6.0 examines the industries likely to be most affected, their specific imaging needs and the possible impacts of a defined digital image architecture.
In Section 7.0 the Task Force suggests additional work that must be completed, to move towards a full implementation of the of the digital image architecture. The list of suggestions included in Section 7.0 is not exhaustive; it is recognized that in the process of validating the architectural concepts, additional areas for further analysis will be identified. An extensive list of questions is included which should be considered in the process of establishing standards for an architecture.
The suggestions include the following items of high priority:
Two reference documents were utilized in the process of creating the definitions which follow:
One of the major objectives of this Report is to define a system architecture which promotes sharing of images an equipment across applications and industry boundaries. To achieve this goal, the digital image architecture must be high flexible to deal with a variety of diverse requirements, including the evolution of technology.
A Digital Image Architecture should be an open system, that is, one made up of functional modules with standard, public interfaces which can be assembled into a functional system "a set of interconnected elements constituted to achieve a given objective by performing specified functions." Explicit objectives of the architecture include:
This requires careful attention to the definition of the interfaces -- the shared boundaries -- between the functional modules.
The key interface definitions are
Scalability deals with the ability of an imaging system to adjust the level of performance by varying the amount of data that is stored, transmitted, received, or displayed -- up to the maximum resolution that was originally acquired. A number of specific definitions are implied:
Extensibility implies designing evolution into the system. The transmission and display modules of the system should be cast as building blocks. The building blocks, because of their inherent modularity, may freely evolve over time.
However, in the past few years the evolutionary view of imaging systems has been challenged. At the 26th Annual SMPTE Advanced Television and Electronic Imaging Conference, John Watkinson suggested that we analyze the impact of digital technologies from another perspective: "To think that digital technology only impacts the underlying equipment and that otherwise it's business as usual is to miss the larger transformation that is occurring in each of the affected industries."
From Watkinson's perspective, the transition to a new digital imaging architecture represents the opportunity for a new paradigm. Proponents of this position have encouraged system designers to step back and take a global view of the impact that digital technologies are having on every industry that deals with electronic imaging; to think not just in terms of delivering ever-improving levels of image quality, but to consider what being digital really means.
John Naisbett in his 1982 best seller Megatrends: Ten Directions for Transforming Our Lives, stated that new technologies go through three phases as they become part of our daily lives. Applying Naisbett's model to the evolution of electronic imaging systems leads to the following three paradigms:
A major factor has been the geometric progression in computer processing capabilities - doubling computational power every two years, with little change in cost or size. This progression is projected to continue well into the next century. As a result, high resolution still image processing capabilities are now within reach of every computer user. Techniques once reserved for high-end workstations are now commonly applied in desktop computing, including the recent addition of full motion video as a data type.
Video has also been a major beneficiary of the technology progression. Production systems that only a decade ago required a six foot rack of electronics can now be implemented in a few rack units - or on a few cards that plug into a personal computer.
The tremendous increase in computational power has enabled another critical aspect of being digital - video encoding based on the use of digital compression techniques to reduce the required data rate. A variety of compression technologies have evolved that remove image redundancy within and between video frames. The required data rate may also be significantly reduced by more efficient coding of the image at the source. Developments of such techniques are progressing rapidly and may become useful in the near future.
While compression technology has existed for many years, and continues to evolve, practical implementations for video have only become possible in the past few years due to the rapid evolution of digital processing technologies. This in turn has stimulated new research into scalable video encoding techniques that will allow multiple levels of image quality to be extracted from a single image data stream. Some observers predict that the processing power required for the decoding of scalable digital video streams will be universal and inexpensive before the end of this decade.
Improvements in data compression perform the same function as increases in bit carrying capacity in the communications system - delivery of more bits to the user. In the past decade, increases in communications capacity of several orders of magnitude have occurred.
In such an environment, the longevity of new equipment purchases may be dependent upon a digital image architecture that is designed with adequate provisions for extensibility. To meet this objective the Task Force has focused its attention on three areas:
A research has revealed more about the physiology of vision, prevailing theory has evolved, placing major emphasis on the computational and cognitive role played by the brain and local image receptors. In turn, this research is providing potentially valuable input to the designers of digital imaging systems.
The eye contains approximately two million cones and 120 million rods. The cones are organized into three broad groups of receptors that are sensitive to light in specific spectral bands; while these bands have significant overlaps, they roughly conform to the red, green, and blue portions of the spectrum. Red and green receptors each outnumber blue receptors by a factor of two to one. The dispersion of these receptors is not uniform, thus spatial perception deals with a complex matrix of receptor types and cognitive processing by the brain.
The center of the visual field, an area called the fovea, contains 30,000 to 40,000 cones an no rods. Outside the fovea the density of cones diminishes, interspersed among the high density rods. The cones within the fovea are responsible for high spatial detail perception while the extrafoveal cones and rods play an important role in visual search and influence directed eye movement. Central vision enables use to see detail, while peripheral vision is attuned to change.
Although high spatial resolution vision is restricted to the fovea, the visual system acquires high resolution images over a wide portion of the field of view. This is achieved through involuntary eye movements; high frequency tremor, slow drift, and rapid saccade.
Research has determined that it takes several hundred milliseconds for the eye to acquire a high spatial resolution image, synthesized from a number of overlapping views. Slow drift and rapid saccade are the mechanisms used for repositioning the fovea to acquire these multiple impressions. The tremor appears to be a mechanism to remove high frequency spatial noise. The tremor's oscillation occurs at a frequency range of 40 to 80 Hz over an area approximately equal to the size of a single cone.
Since human vision is binocular, involuntary eye movements also contribute to depth perception: the brain process these overlapping views to obtain differences from which depth and spatial properties are inferred.
The spatial resolution of moving objects is also linked to eye movement:
There is evidence that the brain directs the activity of the image receptors for processes such as establishing white balance and light sensitivity levels. Simple localized analyzers are used to enhance the data transmitted back to the brain. Some of these analyzers are sensitive to a particular edge orientation; there are sufficient analyzers at each location to represent a full set of edge orientations. Additional tuned analyzers cover portions of the range of human sensitivity for spatial frequency, spatial position, temporal frequency direction of motion; and binocular disparity.
The data processed by these analyzers moves to the brain through two types of channels; a set of fast responding channels with relatively transient responses to stimuli, and a set of slower channels with relatively sustained responses to stimuli. Transient channels process the output of analyzers that are tuned for low spatial and high temporal frequency stimuli. Sustained channels process the output of analyzers that are tuned for high spatial and low temporal frequency stimuli.
Transient channels are sensitive to flickering light sources with low spatial resolution; this type of stimulation appears as wide-area flicker and is most noticeable in peripheral vision. At low levels of illumination (where rod vision is used) flicker fusion occurs at frequencies of only a few Hz; as the level of illumination increases and cone vision is triggered the fusion frequency increases.
Flicker from low light level sources such as a television or movie screen typically disappears in the range of 20 to 60 Hz. As screen size increase, taking up a larger portion of the field of vision, or if screen brightness increases, the frequency for flicker fusion increases.
Sustained channels are sensitive to flickering light sources with high spatial resolution; this type of stimulation appears as small area-flicker, often associated with moving objects. In this case the flicker fusion frequency can be much higher than for wide-area flicker; this form of flicker manifests itself as strobing of the object.
An excellent example is found in the single pixel horizontal lines often used in computer graphics. These lines do not appear to flicker on a progressive scan computer display which is refreshed at rates above 60 Hz; but if the same image is presented on an interlaced video display the single pixel lines are presented in every other field (at 30 Hz) and they flicker. This is due to the fact that the persistence of the display phosphor is of shorter duration than the refresh rate; higher scanning rates (either progressive or interlaced) eliminate the flicker.
In order for a new digital image architecture to be interoperable it must deal with existing imaging technologies. This requirement can place many constraints on the design of the architecture. It is important to understand the reasons that these constraints exist to determine if the new architecture must be similarly constrained.
As the display covers a wider field of view at higher levels of brightness, the refresh rate must be increased to eliminate wide-area flicker. If information with high frequency edges such as computer generated text and graphics, is presented on the display it must also be refreshed at a higher rate. The computer industry uses progressive scanning with refresh frequencies above 60 Hz to eliminate flicker, larger display (>=16 inches diagonal) are typically refreshed at 72 or 75 Hz.
The same requirements for the elimination of wide-area flicker are now starting to influence the development of display systems for home entertainment. At the higher end of the home entertainment market it would be desirable for displays to provide a 50 degree field of view, and be viewable at normal room ambient light levels. Such a display has resolution and refresh requirements nearly identical to a large personal computer display.
The choice of a Digital Image Architecture has implications that reach far beyond the normal realm of standards-setting activities. Telecommunications, television, and computing have made major impacts on life in the 20th century -- their integration is likely to have a profound affect on the way that the world communicates, is educated, works, plays and relaxes in the next century.
In addition to holding perceived resolution constant under varying viewing distances, it is considered desirable to provide even greater resolution in some applications, as discussed below and as implemented in current proposals for advanced television systems.
While it would be desirable to design an imaging architecture in which resolution could be scaled in a continuous fashion, a hierarchy based on a progression of related image resolution levels can provide similar benefits to system designers and simplify the process of interoperation. Section 5.2 and Section 5.3 provide a detailed analysis of the variables that affect the perceived resolution of a display and illustrates the principles of a hierarchical digital image architecture with a progression of four image resolution levels.
Throughout this report, the concept of a multi-resolution hierarchy will be discussed and refined. The Task Force has constructed a model to facilitate this discussion. It is recognized that many different sets of numbers can be used within this model. Four levels of resolution have been identified and defined; additional levels can be added to the progression, as enabling technologies allow support for higher levels of resolution. The four levels in the model are:
The evolution of electronic image acquisition systems has been driven primarily by the mass market transmission standards -- NTSC, PAL and SECAM. New applications for video such as professional and personal video systems have been enabled through the economies of scale associated with these standards.
Thus, applications which require higher resolutions than those offered by NTSC, PAL and SECAM have either been forced to bear the expense of system development and low volume manufacturing - a luxury primarily reserved for the military - or to wait for the next imaging standard to evolve. It is interesting to note that the equipment developed for the various analog HDTV systems has seen extensive use in professional applications that need the added resolution afforded by these systems.
The first two steps are
Delivering the imagery to the consumer typically involves the third step,
Finally, the imagery must be decoded for display, requiring
Some of these steps tend to be grouped with a specific level of storage quality, as illustrated in Figure 3.1. This allows a further simplification of the model based on three major system components - ACQUISITION, TRANSMISSION, and DISPLAY..
The advent of video recording provided a degree of decoupling of acquisition from the other components, allowing program producers to create program content without real-time constraints; however, transmission and display remain tightly coupled. Recording media for program content have typically been coupled to the transmission standard to take advantage of the bandwidth reduction techniques applied in the system. The design of consumer VCRs is based on compatibility with the transmission standard; packaged media played by the VCR must therefore conform to the same standard.
While interoperability between the various analog composite video systems has had to overcome differences in frame and line rates, these systems have been remarkably extensible. The acquisition, transmission and display components and the associated services of the system have evolved continuously over the past fifty years.
With the introduction of analog component video recording and processing systems in the '80s the video industry took a major step toward completely decoupling acquisition from transmission and display. The production community soon discovered the advantages of this decoupling.
By using analog component equipment for both acquisition and production, it became possible to edit video without concern for the multi-field color framing sequences that exist in subcarrier encoded composite video systems. Producers also discovered that fewer artifacts were introduced when layering video using component vision mixers and digital video effect systems. Decoupling of acquisition and production equipment from the encoded transmission standard produced far better results than could be achieved with composite video acquisition and production equipment - and the same video recorders also produced encoded outputs for transmission of the program.
To a large extent, the transition from the analog representations of printed media - type, line art, halftones, and color separations - to their digital counterparts, has been enabled by the use of scalable hierarchies for the acquisition, transmission, and display of printed materials. The tools for acquisition and production of print media have been separated from the display hierarchy, allowing output at the desired level of resolution.
Electronic transmission is also beginning to play a major role in the publishing of documents. Compact representations of printed media using page description languages, have allowed high quality print representations to be moved efficiently through the telecommunications network using low data rate modems. Remote printing of documents on fax machines or networked printers is commonplace.
The desktop publishing metaphor has been used as a model to predict similar transitions in other media industries, most notably Desktop Video. However, the transition has not occurred at the pace that many industry pundits have predicted. This is due, in large part, to the difficult task of breaking the problem up into manageable components. That is, to create separate hierarchies for acquisition, transmission, and display of motion imagery.
Interoperability of video systems with other media is facilitated a complete decoupling of the acquisition, transmission and display into separate hierarchies for each component. Such an architecture is depicted in Figure 3.3. Scalable representations of video will be enabled by this decoupling, and technological advances in one hierarchy can take place without upsetting the apple cart in the other two.
If a hierarchical digital imaging architecture is used as the model, a Digital Advanced Television System can be implemented that is equally adept in delivering low cost solutions that conform to single hierarchy, as well as more expensive scalable solutions that support multiple points in the hierarchies.
The acquisition hierarchy can provide image capture solutions at various price/performance points that are appropriate for the application. Production systems can evolve that deal with single image formats, or multiple formats within the hierarchy. This is of particular importance to producers of program content with significant archival value. Imagery can be captured at a higher level in the acquisition hierarchy with an eye toward distribution at one or more of the lower levels of the transmission hierarchy; the archival value of the program is protected as it can be released at higher quality levels in the future as consumers purchase products at a higher level in the display hierarchy.
Viewing transmission as a hierarchy is critical to the concept of interoperability. A hierarchical imaging architecture would support a progression of image quality levels that are interoperable and extensible, and allow for incremental improvements in image quality within a single transmission standard. This requires the use of a scalable encoding structure; a core image would be encoded at the first level of the hierarchy, and enhancement information would be encoded for each of the higher resolution levels supported by the transmission standard.
A scalable encoding structure may be more difficult to design and possibly less efficient for a given quality level than an encoding designed specifically for that level. It has, however, several advantages that will accrue over time:
The display hierarchy allows for a variety of products to evolve at various price/performance points that are appropriate for the application. Some display systems will evolve to single performance levels while others will offer multiple levels of performance within the transmission and display hierarchies.
Scalability plays a major role in the design of decoder and display components. If the transmission system delivers a scalable payload, only that portion of the information which is required for the display system need be decoded. A small personal information system may only need the low resolution component while a high-end home entertainment system can utilize all of the resolution components.
The current pricing structure for broad band telecommunications is typically based on channel bandwidth - the purchaser uses and pays for the entire channel regardless of the amount of information moved through it. In the future, greatly increased channel bandwidth and packetized encoding schemes using headers/descriptors for packet identification, will cause a shift in pricing structure - the purchaser will pay only for the information content that moves through the channel. This concept when applied to video services has been described as pay- per-view-per-bit.
This shift in pricing structure is likely to act as a catalyst for the rapid evolution of video compression techniques and transmission standards, with an emphasis on two areas:
Programmable decoders will be the key component in providing extensibility to the digital imaging architecture. Because of the diversity of image compression standards (Group 3 fax, H.261, JPEG, MPEG, DVI, etc.), these decoders will play an important role in the integration of video and high resolution imaging with desktop computer workstations. This same diversity, with the addition of a digital television standard (or standards) will lead toward the use of programmable decoders in home entertainment and information delivery systems. Essentially fixed solutions will drive the low end of the market, providing inexpensive mass market consumer products, while programmable solutions will dominate at middle and upper levels of the transmission and display hierarchies.
The characteristics of LCD displays are significantly different from flying spot scanning CRT displays. Flying-spot systems must operate at refresh rates above the critical frequency for flicker fusion; display brightness is limited since the spot is the only source of illumination (most of the display is decaying at any point in time).
Every pixel in an LCD display receives constant illumination. LCDs can be characterized as having long persistence; in fact, a significant design challenge has been to provide faster pixel response to deal with full motion video. This has been accomplished through the use of a transistor at each pixel location (an active matrix display), providing rapid response for pixel replenishment.
The nature of the active matrix circuit also allows a pixel value to be held for at least one second without replenishment, giving the display characteristics similar to a frame buffer. Direct addressing of each pixel location would make it possible to update only those pixels which change from one refresh period to the next. Transmission systems that utilize digital compression techniques to eliminate interframe image redundancies may take advantage of these aspects of LCD displays to implement conditional replenishment.
Over the next 10 to 15 years image acquisition and display technologies are likely to move to conditional replenishment. Image acquisition systems may evolve with on-board digital processing to implement conditional image acquisition. These cameras will be programmable, offering several advantages over scanning cameras that continuously update the entire image raster, including the ability to:
Backward compatibility to existing systems and extensibility to future systems present many technical challenges. The greatest challenge lies in preserving the value of existing infrastructures while enabling an orderly transition to the new architecture. For example, immense investments have been made in the aquisition and transmission infrastructures of our existing NTSC, PAL and SECAM television systems. Likewise, billions of consumers have invested in receivers and video recorders that support these systems. It is equally critical that investment in the vast archives of information and entertainment programming that exist today on film and video be protected, and that the new architecture unlock the economic potential of these archives.
In deliberating on these critical issues, every effort has been made to balance the interests arising from those investments with the future benefit to all of a single global standard. These deliberations have also taken into considerations the installed based of computer, medical, engineering and scientific imaging systems, and the diverse applications for still imaging in electronic publishing, visual databases and communications. Existing systems that demonstrate interoperability and extensibility - including some which have in fact been extended - were considered. Examples include the French Minitel system and the family of international facsimile standards.
The seven critical issues are:
Scalable and interoperable hierarchies offer many benefits when communications channel issues are considered. Such an approach promotes effective utilization of existing communications channels and the development of new broad band communication services. The lower levels of the hierarchy provide solutions for the capacity constrained channels that exist today. The introduction of new broad band communications services will enable the use of higher data rates to support the improved performance available at higher levels in each hierachy.
A digital image architecture that provides interoperability across applications with different spatial resolution requirements must be scalable in terms of resolution as discussed in Section 3.3. Interoperability also requires a family of related image acquisition and display rates. The greatest benefit, in terms of cost and simplicity, is gained when the display operates at the same rate as, or an integer multiple of the image acquisition rate. Though more expensive to implement, the greatest performance benefit is gained when motion compensation techniques are used in encoders/decoders to create in-between frames for display. Section 5.4 discusses the requirements for such a family.
To facilitate this hierarchical approach to a digital image architecture a scalable approach to image coding is required. Furthermore, improved techniques for video compression are likely to be enabled by the geometric progression in computational hardware. The design of the architecture must make provisions for this progression. Section 5.5 discusses the use of scalable coding algorithms.
No topic generated as much discussion in the Task Force as image acquisition and display refresh rates. This is due in part to the diversity of rates that exist in the standards and resulting practices within each of the affected industries. The issue is further complicated by the evolution of television down parallel paths with respect to field rates. Their harmonization will require solutions that lie in the realm of digital technology as well as the realm of politics and negotiation.
The choice of an image acquisition rate is a tradeoff between motion rendition and the resulting data rate. The following considerations are important in establishing a family of acquisition rates.
Experience has shown that for wide-screen CRT displays of high brightness, a refresh rate in the region of 72 to 75 Hz is required to achieve tolerable levels of wide-area flicker (see Section 3.2.5). In some situations refresh rates in excess of 100 Hz may be desirable. Receivers which operate at 100 Hz (double the normal 50 Hz interlaced scan rate) are being introduced in the 50 Hz market; rate doubling receivers operating at 120 Hz are also being developed for the 60 Hz market.
The relationship of display refresh and image update rates shoul be based on a progression that permits non-interpolative transformations between the acquisition and display rates in the new architecture (i.e., display at integer multiples of the image update rate). As an example, theatrical display of film is usually double or triple shuttered to minimize wide-area flicker of the display.
Further research into the choice of a single family of acquisition rates and display rates is required. An appropriate interoperable family should include a 24 or 25 fps image acquisition rate which would enable a 72 or 75 Hz display refresh rate. This is the subject of further discussion in Section 5.0, Section 7.0 and 4.4 The Use of Square Sampling Grids (Square Pixels) The computer graphics, image processing, and publishing industries have adopted the use of geometrically square pixel sampling grids (frequently simply referred to as square pixels). The use of square pixels facilitates:
Instead, computer graphics gravitated towards a common display technology based on square pixels. This simplified system design, which led to lower cost and better performance, enabled equipment and services to be used as commodities across a broad set of industries. Today the computer industry is a major consumer of displays, second only to consumer television receivers.
The use of a common pixel geometry eliminates the need for interpolative resampling when sharing imagery among all users. Resampling has two costs:
In the future it may be possible and desirable to extend the colorimetry representation to include a wider range of colors, possibly even including those of self-luminous objects, as one example. A close examination of this issue is needed to establish the range of colors to be represented within the colorimetry of the digital image architecture.
A similar situation to that of colorimetry exists for the representation of dynamic range transfer function. Current systems are individually optimized for the current technology and application and are not easily amenable to an increase in dynamic range. Mechanisms to effectively handle a much wider dynamic range need to be identified.
The situation is somewhat similar to that of motion picture film in which the latitude of the negative film enables exposure and color adjustment after the image capture and the S-curve of the film characteristic provides effective compression of the highlights and dark regions. Similar provisions may be required in digital image systems to provide reasonable representations for both small and large numbers of bits. A further consideration may concern the optimal distribution of any necessary compression/expansion in respect of overall image quality.
It is also important that images of differing colorimetry and dynamic range at the acquisition device should be able to be combined effectively into a single image, when appropriately scaled.
The color space and dynamic range representations that could meet these objectives require extensive consideration. Section 7.7 includes a number of questions that should be considered in the analysis of these and other colorimetry issues.
In Section 3.2.5 it was established that higher scanning rates are required with displays that cover a wider field of view and/or operate at higher levels of brightness than today's television systems. Decoupling the refresh rate of the display from the image update rate provides a mechanism to deal with wide-area flicker - this is discussed in 4.7 Identification of the Characteristics of a Digital Image Stream (Header/Descriptors) A fundamental prerequisite for interoperability in digital systems is a mechanism for identifying and describing digital image data. For this information to be shared, decoders must be capable of identifying and conforming to the incoming data. Even simple decoders - those that only recognize a single standard - must identify data streams which they can decode. This is one of the primary functions of the header. Decoders must also ignore unrecognized data, to allow for extensions to the data stream.
Descriptors provide application oriented information, such as image and coding parameters, processing history, identification of program content, copyright, and scrambling. They also enable extensibility; the descriptor may also contain the coding algorithm or language representation necessary to interpret the encapsulated data. This provides a mechanism whereby expert groups can create and standardize the transmission of messages to meet their needs.
Descriptors may be used to identify and describe data at different levels of an image hierarchy, thus allowing a display system to decode only that part of a stream necessary for its function or capability. Descriptors might also contain information about the preferred display characteristics for imagery.
Thus information such as the colorimetry of the original acquisition system, and the transfer characteristics of the process used to move images from one media to another, can be included with the data. Decoders would use this information to optimize display of the image.
The SMPTE Task Force on Header/Descriptor in their Final Report dated January 3, 1992, and approved by the SMPTE Standards Committee on February 6, outlined the criteria for the use of Header/Descriptors. Work is now progressing on the development of proposed SMPTE Standards, Recommended Practices and Engineering Guidelines.
This is by far the most critical issue of all, so much so that its impact is clear in the discussion of many of the previous issues. Only the last of them, the use of headers/descriptors, is without precedent in existing entertainment industry practice. It is precisely where a dichotomy exists in current practice that the greatest controversy arises - on the issue of temporal rates.
The convergence in being digital may provide the solutions which will resolve the temporal rate issue; convergence around the common language of digital coding, the progression in CPU performance, and the ability to design inexpensive modular interfaces in the form of mass produced microchips.
It is likely that a number of solution will evolve to facilitate interoperability between the existing world of film and analog television, and the new digital image architecture. These solutions should provide a variety of price/performance options appropriate to the applications requirements.
To illustrate the model, specific numbers have been chosen that take advantage of the mathematical relationships discussed in Section 4.0, as well as the architectures of digital memory and processing components. These numbers are not intended as the basis for a standard, but rather, provide a starting point, from which the validity of the architectural concepts can be verified. Further work is required for verification of the model and determination of the exact numbers, upon which a standard can be based (see Section 7.0).
The following parameters of a hierarchical digital image architecture are discussed in this section:
For a digital image architecture to be cast as an open system, two steps are required:
It can be argued that there is no need for rigid architectural standards in a digital world; that programmability in the transmission and display hierarchies provides a sufficient basis for interoperability. Perhaps some day this will be true. If the goal of longevity for the first digital image architecture is achieved, it is likely that the designers of the next imaging architecture will be less constrained than we are today.
The first digital image architecture however, must provide a bridge from the closed systems of the past to the open systems of the future. The fundamental structure of the digital building blocks and economies of scale associated with standardization suggest that the organizations charged with establishing these standards work in harmony.
If the resolution of a display is held constant and the viewing distance is a variable, the resolution perceived by the viewer - measured in cycles per degree - will increase as the viewer moves away from the display. Therefore, all displays can be considered to be high resolution if viewed from an appropriate distance.
At a distance the varies with the visual acuity of each individual, the actual resolution of the display equals the limit of that viewer's ability to resolve image detail. Beyond this viewing distance additional image detail cannot be perceived; that is, the display has more resolution than is required for this viewer and set of viewing conditions.
In some cases excess resolution may be desirable. For example, the operator of a personal computer can typically reduce the viewing distance to a high resolution desktop display by one-half, simply by leaning forward, thus taking advantage of additional resolution improves enough to be significant, while moving 15 inches in a movie theatre would have little effect on perceived resolution.
The NTSC transmission standard was designed to provide a resolution of approximately 21 cycles per degree over a viewing field of just under 11 degrees. Display size can be variable in today's television, ranging from a diagonal of a few inches (a personal display) to more than 30 feet (direct view displays in stadiums and projection displays in controlled lighting environments). These displays differ only in the size of their pixels. At the appropriate viewing distance, the perceived resolution of the personal display and the stadium display will equal the design goal of 21 cycles per degree, and both displays will cover 11 degrees of the observer's field of view.
Many display applications require higher levels of perceived resolution. To increase the level of perceived resolution, while holding viewing distance constant, additional samples of the same image must be added, increasing pixel density. To cover a wider field of view, as in wide-screen displays, holding the same viewing distance and perceived resolution, new information, at the same pixel density, must be added to extend the picture.
With personal, home entertainment and theatre displays, the viewer can vary the distance from the display, and thus vary the perceived resolution, over a significant range (see Figure 5.1). Taking into account the variations in acuity in the population, and variations in viewing distance for each application, it is common practice to design a display system for the average viewing conditions in each application. The overlaps in cycles per degree between low, normal and high resolutions are shown in the table to account for these variations.
Resolution Cycles per Degree Low 1 - 15 Normal 10 - 25 High 20 - 30 Ultra High 30 - 40A special case exists for head mounted displays which provide a fixed viewing distance; here the display manufacturer must select the level of resolution appropriate for the application and then design for a specific perceived resolution.
Using these guidelines, a high resolution display designed for a 35 degree field of view would require about two thousand pixels per line at 30 cycles per degree. In a desktop computing application where the viewer is 30 inches from the display, the length of an active line (display width) would be about 19 inches. In an entertainment application, such as a consumer television receiver viewed from a distance of 108 inches (9 feet), the length of an active line would be about 68 inches.
These examples are illustrated in Figure 5.2. In this figure the principles described in this section are used to illustrate the relationships between the four resolution levels of the model hierarchy and a variety of display applications. The numbers, especially as they relate to image size (in pixels) are entirely relative; they serve only as examples of the pixel count required, at average viewing distances and fields of view, to achieve the specified perceived resolution.
It is important to note that seemingly diverse applications such as personal computer and home entertainment displays have similar resolution requirements as the size of the home entertainment display increases beyond the narrow field of view of today's television receivers. It is also important to note that direct view CRT displays (which are currently limited to around 40 inch diagonals) require resolution in the normal range for home entertainment applications.
It is noteworthy that such sequences also appear in the computer processor and memory component industry. This approach takes full advantage of the generic building blocks that are the driving force in the transition to a digital world.
In order to provide continuity between the various resolution levels of the hierarchy the model is based on the concept of an image tile. For the purposed of this discussion, a tile can be considered to be a constant portion of an image, representing the same part of the image regardless of the resolution level or image size. Thus, at each higher level in the hierarchy, the resolution within a tile doubles in each axis. This is illustrated in Figure 5.3.
The power of two progression may now be applied to determine the resolution, in pixels, for each level in the hierarchy.
Resolution Pixels in Level Name in Cycles Pixels in 32 x 32 per Degree One Tile Tile Superset 1 Low 1 - 15 16 x 16 512 x 512 2 Normal 10 - 25 32 x 32 1024 x 1024 3 High 20 - 30 64 x 64 2048 x 2048 4 Ultra 30 - 40 128 x 128 4096 x 4096 HighIn this model a tile represents an area equal to 1/32nd of the image at any level of the hierachy. Thus each level consists of a 32 x 32 set of tiles (see Figure 5.3). The selection of this fraction for a tile is arbitrary; it was chosen because it is a convenient building block - integer multiples can be used to construct displays at all of the aspect ratios and spatial resolutions discussed in the model.
The diagram in Figure 5.3 establishes several important relationships that provide a bridge to the past and illustrate how interoperability can be achieved:
Thus, using tiles and only four resolution levels, it is possible to construct a display for virtually every possible application; furthermore this display can also be used to show imagery from other levels of the hierarchy. This is especially practical if a scalable coding architecture is implemented that conforms to the same resolution progression.
Since significant archives of high resolution program material exist on film, which was acquired at 24 or 25 fps, one of these rates should be included in the progression. A progression based on integer multiples of 12 would include 12, 24, 36, 48, 60, 72, 96, 120 Hz, etc. A progression based on integer multiples of 12,5 would include 12,5, 25, 50, 75, 100, 125 Hz, etc. These progressions might also include integer fractions of 12 or 12,5 (e.g., 1/2 or 1/4 of the base frame rate for applications such as videoconferencing and searching of video databases.)
It has been common practice in Europe to display 24 fps film at 25 fps for compatibility with PAL and SECAM.; this results in a 4% speed increase. Many European programs produced for television distribution are acquired at 25 fps; if the family of rates is based on 24 fps, these programs would be played 4% slower. As indicated in Section 4.8, further research is required to determine the impact of choosing one of these rates, on those industries that utilize film for image acquisition.
Ideally, compatibility with existing electronic imaging systems should be accommodated in the design of the standard modules that will interface these systems with the digital image architecture. By design, this would place the burden of compatibility on the systems that are being replaced rather than products that conform to the new architecture; thus the future will not be constrained by today's limitations.
In the process of developing the existing analog and digital high resolution television systems, the designers of these systems have demonstrated the practicality of such a modular approach to interoperability. A variety of translation devices have been demonstrated that allow interoperation between PAL, NTSC, HD-MAC and MUSE. The interface modules that will be required to transform the signals from these systems (especially NTSC and PAL) into the new architecture, offer the potential for large volumes. It is likely that the market for these modules will be characterized by intense competition, leading to a range of solutions at various price/performance levels.
In the near term the choice of a family of rates based on 12 or 12,5 Hz would provide optimally low cost and high performance, for both advanced television and computer uses, as well as providing global interoperability. In the longer term decoupling of acquisition, transmission and display is likely to lead to entirely new approaches to pixel replenishment that may render the current concept of image acquisition rates and display refresh rates meaningless.
This approach enables extensibility. For example, the coding of low resolution imagery might remain unchanged to provide compatibility with existing decoders, while new coding methods, made possible by the geometric progression in computational hardware, can be introduced to support more advanced imagery. Increasingly powerful (and affordable) programmable decoders can provide compatibility with the standards that form the foundation of the digital image architecture, and the additional processing power required for future enhancements to the architecture.
It is becoming difficult to draw the line, even today, between consumer electronics and computers. Today's video game machines, already in millions of homes, are marketed as consumer accessories to televisions, but are in fact, more computationally competent than personal computers of only a few years ago. Similarly, personal computers are being marketed to the home market through traditional consumer electronics channels.
Traditional business factors should always be considered. These include equipment replacement costs, amortization, benefits, competition, market needs, and access to material.
Successful industry participants will both pay close attention to emerging trends and help to bring them about. Sometimes, deep pockets may be required to create a market. (It took years of major losses in both equipment and programming efforts before color television became profitable.) In contrast, agreement on a common architecture across a wide range of industries and applications would spread the costs and encourage early adoption.
The groupings used for this report help to relate application requirements to industries. It is well understood that there is already much overlap between industry groups and applications.
The industry groupings are as follows:
The technologies used in these fields are highly dependent on downstream profits. It can be difficult to justify large investments (e.g., an HDTV production facility) in new technologies that can only be utilized by a small portion of their market. Smaller investments that require minimal infrastructure changes (e.g., MTS stereo, VHS-HQ) can be more easily justified, particularly when end-users can benefit with existing equipment or rapid upgrade is anticipated. Backward compatibility and extensibility are key issues here and can only be successfully violated when there are substantive benefits to the end user (e.g., audio compact disc).
Revenue streams can often be anticipated to flow well beyond the initial release of the product. Residuals from syndication, rentals, and sales require that providers anticipate future trends in end-user viewing equipment capabilities. This is one reason why most prime time television is shot on 35mm film and not video.
There is some effort to establish a video dialtone similar, in concept, to today's voice telephone dialtone. As communication networks increase bandwidth, and compression technologies improve, an increased use of remote real-time visual communications can be expected.
These same advancements also facilitate rapid downloading of video information from media servers; At a 100:1 compression ratio, the data for a typical motion picture could be transmitted in a few minutes over a video capable network.
Because of the universal proliferation and conversion standards for the telephone, it is likely that we will soon see extensions of current fax standards including: voice fax (voice mail), high resolution color image fax, and video fax (video mail). One of the driving forces behind the development of the JPEG image compression standard was the need for an efficient data reduction technique for the transmission of still images.
The telecommunications industry is well down the road in the establishment of digital imaging standards. The CCITT, which controls fax standards worked with the IEEE on the JPEG standard and the videoconferencing standard, know as P.64 or H.261. These groups are also responsible for the MPEG family of moving picture standards. JPEG and MPEG I and P.64 form the basis for the first generation of image telecommunications products that are already starting to reach the market.
These standards were designed with a high degree of flexibility to deal with a variety of imaging applications; they have served as excellent examples for the Task Force in the area of interoperability, and scalability. Currently the MPEG group is working on extensibility; MPEG II is targeted for the delivery of higher quality motion image data streams in the range from two to forty megabits per second. The MPEG working group is investigating scalability as a requirement for this extension of MPEG. It would be beneficial for these new standards to relate harmoniously to other digital imaging architectures.
The merging of both broadcast and interactive voice, image (including graphics and video), text, and data across diverse transport media will create challenges in properly matching the information with the delivery mechanism. Current efforts to implement interactive television, for example, use differing transmission media for each direction (e.g., broadband in; telephone or cellular radio out).
Factors such as existing infrastructure, projected time and cost to deploy, bandwidth cost, regulatory issues, nature of the signal, target viewer, compression, error sources, localization, security, latency, etc., need to be considered.
The communications infrastructure deployed for the entertainment market could provide a profound leverage for the information domain. For example, a broad consumer demand for access to high bandwidth entertainment (and other) services could accelerate the national installation of fiber-optic cables. Once in place, these high bandwidth networks could also be used as high performance links to super-computers and very large data bases, and broadly distribute real-time business, engineering, and scientific data.
While installation of fiber-optic cable to a major user base can take many years, new or existing satellites can cover huge population areas very quickly. A variation of direct broadcast satellite (DBS) transmission is spot-beam satellite technology. In this approach, as few as three satellites could be used to provide localized high quality (HDTV) signals to small inexpensive receiving devices in as many as 150 geographic areas within a country the size of the continental United States.
The computer, medical, and graphics industries could similarly benefit from harmonious formats that would allow them to produce image generating, manipulating, managing, storing, and viewing applications and devices at reduced cost and increased interoperability.
Some specific industrial application areas include security equipment for surveillance and identification and product and process inspection.
This will create opportunities in the receiving devices, the electronic components that go into them (e.g., semiconductors, light sources and modulators) and the subsystems (e.g., displays, tuners, and signal processors). The likely emergence of new product categories can both heighten and personalize the entertainment experience.
Ancillary devices (e.g., tape and disc recorder/players, camcorders, editing, processing, sound systems, printers, scanners, interactive peripherals) will be additional sources of added value products.
It is likely that computer control technologies will play an ever increasing role in home entertainment and information systems. The integration of all of the equipment listed in the preceding paragraph in the home entertainment environment has proven to be a major problem - and a significant opportunity. We have seen programmable remote control devices evolve to replace the profusion of separate infrared controllers (TV tuner, cable tuner, VCR, laserdisc, audio CD, radio tuner, etc.). The integration of the graphical user interface from the world of desktop computing with the home entertainment/information system has begun.
Collaborative cross-industry efforts will merge computers into home entertainment networks, dealing with the issues of component integration, connection to multiple sources of entertainment and information, user interface, and "user friendly" programming of the system. Various flavors of "personal computers" in the home will be able to connect to this network as well as intelligent appliances and remote control devices. Inexpensive networkable cameras will allow remote visual monitoring; the front door; the baby's room; etc.
To provide specific types of information to users, new classes of specially tuned information appliances will likely develop. These appliances will rely on information providers to collect, generate, and organize information. In the education market, for example, an information appliance might be tuned toward providing everything a student needs to progress through a particular class. Besides basic course content, texts, lecture notes, assignments, etc., it could make extensive use of imagery to provide interactive multimedia tutorials, remedial help, lab simulations, extensive reference material, electronic messaging, and smart links to classmates.
In the information age, a critical challenge is the productive management of the overwhelming amount of information produced each year. Unfortunately, images and video tend to make this problem even greater. While database search engines deal reasonably well with keyword searches and inverted indexes on textual data, corresponding tools for other media have tremendous opportunities for improvement.
Museums and libraries could use electronic file systems to catalog and view very high resolution images of the masters. Sculptures and other three dimensional objects could be shown on stereographic or holographic displays, or printed on very high quality large format printers.
The role of the artist and graphics designer has changed dramatically as the quality and flexibility of the "electronic canvas" has come to emulate the various forms of traditional media. Just as the camcorder has allowed many budding cinematographers to explore their art, high resolution drawing tools with interactive training are revolutionizing electronic publishing and winning over graphic artists. Many artists are expanding into new markets such as videographics and animation from this electronic base.
Traditional forms of printing and publishing information delivery will continue to exist alongside of newer mediums. Electronic billboards could change messages by day of week or time of day. Electronic books, magazines, catalogs, and advertisements can integrate interactive video and other media to tell a story, make a point, or sell a product. They can also elicit information from the user that can provide useful information to the publisher (e.g., "hard to understand this concept," "would like product in green").
Institutional training represents the high end of the educational market. An economic return on investment can often justify the use of expensive technology to maximize training "productivity" since the employee students are being paid wages while not working. Increased use of sophisticated interactive multimedia tools developed and used in these environments could find derivative use in public classrooms and the home.
This community has often utilized high-end versions of consumer technologies (e.g., TV CRT/Workstation CRT). Their role in leading versus leveraging the next generation of imaging systems is not clear. The existence of a proper digital image architecture will reduce barriers across applications, platforms, and markets.
High resolution imaging can be useful in radiology, microscopy, patient monitoring (especially during surgery), and consultation with specialists in a remote location.
Image requirements can be very stringen