SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Oracle I/O:
                                 Supply and Demand

                                  Bob Sneed (Bob.Sneed@Sun.Com)
                     SMI Performance & Availability Engineering (PAE)
                              SUPerG @ Amsterdam, October, 2001
                                                Rev. 1.1



Abstract
Storage performance topics can be meaningfully         Operating Environment (OE) and underlying
abstracted in terms of the economic notions of         storage equipment.
"supply" and "demand" factors. The analogy
                                                       While Economics has created a wide variety of
can be extended into characterizing I/O
                                                       modeling     strategies   and     computational
problems as "shortages", and to the notion of
                                                       techniques, the intention here is to steer away
addressing shortages with both supply−side
                                                       from formalism and merely borrow some
and demand−side strategies.
                                                       economic terminology at an abstract conceptual
While there is often much that can be done with        level. When I/O is involved in a problem at all,
SQL tuning and schema design to get a given            it may be possible to improve results with
task done with the minimal amount of I/O, this         strategies and tactics of either decreasing I/O
paper focuses primarily on platform−level              demand or increasing I/O supply.
configuration and tuning factors.       A wide
                                                       After a few introductory metaphors, we will
variety of such factors are discussed, including
                                                       present several key Oracle demand topics, then
disk layouts, filesystem options, and key Oracle
                                                       survey many supply factors in the I/O stack of
instance parameters.
                                                       the Solaris OE − from the bottom up.
Practical advice is offered on configuration
strategies, tuning techniques, and measurement         1. Economic Metaphors
challenges.
                                                       Much of the language of Economics is easily
Introduction                                           adaptable to discussion of I/O topics.

The field of Economics offers a wealth of              − Equilibrium
opportunities for phrasing metaphors and               In Economics, the intersection of supply and
analogies relevant to Oracle database I/O.             demand curves indicates a market equilibrium
While it is easy to get carried away with              point. With I/O, statistics observed at the disk
excessive creativity along these lines, the basic      driver level represent an empirical equilibrium
concept of distinguishing supply issues from           of sorts based on the interaction of numerous
demand issues can actually be quite useful for         supply and demand factors. While a great deal
explaining many database I/O options and               of constructive tuning can be accomplished by
tradeoffs.   This frame of reference is no             studying iostat data, these numbers do not
substitute for proven analysis and tuning              really illuminate demand characteristics much
methodologies, but it is hoped that it will prove      at all past the point of observing channel or disk
useful in the various processes of architecting,       utilization or saturation.
configuring, and tuning Oracle to the Solaris™
                                                       When business throughput is meeting
                                                       requirements and expectations, equilibrium


    1
measurements are interesting for capacity                cache hits,    once    again   using   a   unique
planning and for spotting operational                    mechanism.
anomalies.       However,  when    business
                                                         There exists no comprehensive instrumentation
throughput is not satisfactory, equilibrium
                                                         which spans all layers of the I/O stack. Some
measurements may not help clearly identify
                                                         interesting statistics, such as range of variance
options for improvement.
                                                         at various levels, are not generally observable.
− Measuring Supply & Demand                              There is certainly room for progress in
                                                         instrumenting the I/O stack to get a clearer
In Economics, demand is said to vary with                picture of what’s going on.
price, and it is usually very difficult to
characterize without conducting experiments.             Oracle’s wait event statistics are the principal
In this discussion, the notion of demand is being        tool for observing what Oracle is waiting for,
used absent the notion of economic cost, but the         and how much stands to be gained by
difficulty of characterization is similar to the         improvements in specific areas. Guidance in
economic case.                                           interpreting these statistics is offered in several
                                                         texts, including [YAPP Method, 1999] and
With I/O, the term I/O operations Per Second             [ORA_9i_PER].        While these tools and
(IOPS) is frequently employed to characterize            techniques undoubtedly provide the best
both empirical and theoretical throughput                holistic approach to Oracle I/O optimization,
levels. This statistic is often of limited utility, as   close cooperation between DBA’s, System
theoretical and marketing IOPS numbers are               Administrators and Storage Administrators is
often difficult to match in practice, and the            required. It is hoped that perspectives offered
circumstances under which they occur is not              here will be useful in the furtherance of such
cited with any consistency.                              collaborations.
Oracle STATSPACK1 reports are most useful for
                                                         − Quotas & Tariffs
determining actual I/O statistics per file since
its measurements incorporate queuing delays              The I/O stack contains a wide assortment of
and throttles from the filesystem and volume             throttle points and fixed limits. Knowing where
manager layers, as well as possible read hits            these are is the key to working with them or
from the OS page cache. In contrast, iostat or           working around them, just as everyone looks
Trace Normal Form2 (TNF) data will only show             forward to shopping at duty−free stores.
what passes to the physical I/O layer of the OS.
                                                         − I/O Surplus
Oracle’s measurements are per file, while
iostat measurements are per device.                      In Economics, surpluses are associated with
Untangling the mapping of files to volumes,              lower prices and modern phenomena like
channels, and devices can be a laborious task. It        inventory write offs. With I/O, it would appear
may add to the confusion that an Oracle                  that the main consequence of excess supply is
PHYSICAL I/O is a LOGICAL I/O to the OS,                 wasted capital.
and thus may reflect cache hits in the OS page           In practical terms, an excess I/O supply may
cache. Indeed, to the degree that Oracle’s I/O           have tangible advantages, such as allowing
statistics do not correlate with OS I/O statistics,      certain application optimizations, or providing
the OS page cache provides the obvious                   the comfort of ’headroom’ from a configuration
explanation.                                             management perspective. Certainly, surplus
Veritas offer statistics at the Volume Manager           supply is better than a shortage, and properly
level for various objects, using its own means           configured systems should possess some degree
of data capture and presentation. With the               of surplus for the sake of headroom alone.
Veritas Filesystem’s Quick I/O (QIO) option, a           − I/O Shortage
means is provided to observe per file OS page
                                                         In Economics, the consequences of shortages
                                                         include higher prices, longer lines, and maybe
     1
      The ancestor of STATSPACK prior to Oracle 8 is     even famine or despair. With I/O, the simple
BSTAT/ESTAT reporting.                                   analogue to higher prices is increased latency
    2
      See prex(1) et al.


     2
due to queuing, and famine and despair might                some subsequent points of discussion will
be terms with which a user could relate when                include their bearing on ORA−27062 exposure.
system response is especially poor.
The only effect one normally expects from an
                                                            2. Oracle I/O Demand
I/O shortage is degraded response and longer
                                                            Many design and tuning techniques are aimed
run times. However, worse things are possible.
                                                            at reducing I/O demands.        Conventional
With Oracle, for example, when using
                                                            wisdom says that 80% of the gains possible for
Asynchronous I/O (AIO), any request noted to
                                                            an application lie in these areas.        This
take longer than 10 minutes usually3 results in a
                                                            conventional wisdom often leads to neglect of
fatal ORA−27062 incident. Failure to update a
                                                            many important factors at the Oracle and
control file can cause a fatal error after 15
                                                            platform levels. The discussion offered here
minutes.
                                                            largely avoids tuning topics found in many
It is rare that timeouts like these occur purely            books, such as database schema design,
due to I/O shortages. They are normally                     application coding techniques, data indexing,
expected only in the event of hardware failures.            and SQL tuning.
However, it would appear that the vast
majority of Oracle timeout events actually result           Oracle Demand Characteristics
with no hardware error present, and usually at
load levels where one would not predict an I/O              Different Oracle operations have different I/O
could possibly take 10 minutes.                             demand characteristics, and understanding
                                                            these is important to formulating useful
− Supply Chain Failures
                                                            strategies and tactics for tuning.        Some
The I/O stack is like a supply chain, and                   vocabulary is introduced here for use in
failures at lower levels cause a domino effect of           subsequent discussion of controlling Oracle I/O
problems through the higher layers. Here                    demand.
again, there is room for improvement in
instrumenting the I/O stack to simplify                     − Synchronous Writes
isolation of problem areas.                                 All Oracle writes to data files are data
                                                            synchronous5 . That is, regardless of filesystem
− Death and Taxes
                                                            buffering, writes must be fully committed to
Hardware failures and error recovery can                    persistent storage before Oracle considers them
seriously skew performance, or even lead to                 complete. This synchronous completion criteria
ORA−27062 timeout incidents. Measures taken                 is quite distinct and separate from whether or
to insure against such failures, like RAID data             not writes are managed as AIO requests.
protection or Oracle log file multiplexing, can
each take their toll − much like a tax or life              − Latency, Bandwidth & IOPS
insurance premium. Hardware failures and                    These three common and closely related
compromises to insure against them are just as              measures of I/O performance each have their
certain as death and taxes.                                 place.
Certain bugs have been associated with ORA−                 The most latency sensitive I/O from an Oracle
27062 incidents, but perhaps the most common                transaction perspective are physical I/O
cause arises from AIO to filesystem files due to            operations on the critical path to transaction
an issue recently identified with the AIO library           completion. This includes all necessary read
and the timesharing (TS) scheduler class4 by                operations and writes to rollback and log areas.
Sun BugID 4468181.          In the interest of              In contrast, checkpoint writes are not latency
minimizing deaths from ORA−27062 incidents,                 sensitive per se, but merely need to keep pace
                                                            with the aggregate rate of Oracle demand.

     3
        As of Oracle 8.1.7.2, or with patches to selected
                                                                 5
prior releases, the message "aiowait timed out" may                This is due to Oracle’s use of the O_DSYNC
occur 100 times before becoming fatal condition.            flag in its open(2) operations, but it is inherent with
      4
        See priocntl(1) and priocntl(2).                    RAW disk access.


     3
Decision Support Systems (DSS) and Data             The art of database layout includes application
Warehouse (DW) loads typically depend a great       of principles such as I/O segregation, I/O
deal on pure bandwidth maximization. Also,          interleaving, and radial optimization. Layout
operations like large sorts can depend heavily      decisions can be significantly constrained by
on effective bandwidth exploitation.        In      available hardware and project budgets.
contrast, Online Transaction Processing (OLTP)
                                                    Disoptimal layout can decrease I/O supply by
workload performance frequently hinges more
                                                    causing disk accesses to compete excessively
on random IOPS.
                                                    among themselves; by disoptimal use of
− Aggregate Demand                                  available cache; or by placing bottlenecks in the
                                                    critical path of        temporally dependent
Different programming and tuning factors can        operations.
significantly impact the total amount of I/O
demanded to execute a given set of business         − Localities of Reference
tasks. The sum total of I/O required can
                                                    The concept of localities of reference is that certain
significantly impact run times and equipment
                                                    areas of a larger set of data may be the focus of
utilization.
                                                    frequent accesses. For example, certain indexes
− Temporal Dependency                               might be traversed quite frequently, yet
                                                    represent a small subset of the total data.
Some things have to happen before other things
can happen. This simple notion of time ordered      Access within good localities of reference stands
dependencies is all that temporal dependency        to gain from optimal cache retention strategies.
means. For example, log file writes must            Cache efficiency will vary depending on how
complete before an INSERT or UPDATE                 data retention strategies align with actual
transaction can complete, and before the            localities of reference.      Opportunities for
corresponding checkpoint writes can be issued.      caching occur in the Oracle SGA, the OS page
                                                    cache, and external storage caches. Each offers
− Temporal Distribution                             different controls over retention strategies, and
In this context, the term temporal distribution     these may be configurable to achieve
means "distribution of demand over time". The       compounded beneficial effects.
terms "bursty" and "sustained" are common
characterizations of temporal distribution. In
                                                    Controlling Oracle I/O Demand
some cases, tuning to distribute "bursty"
demand over time will result in improved            We’ll skip the obvious topics of proper database
service times and throughput.                       design and index utilization, and focus on the
                                                    areas typically controlled by an Oracle DBA.
As an analogy, consider the arrival of a bus at a   Topics discussed here represent several of the
fast food store. The average service time for       I/O issues customers commonly wrestle with
each customer will be seriously impacted by the     Oracle, and a few of the "big knobs" used to
length of the line. If the same customers           control these factors. Some of these topics
strolled in individually over a period of time,     pertain to the "Top Ten Mistakes Made by
they would likely enjoy a better average service    Oracle Users" presented in [ORA_9i_TUN].
time. With demand evenly distributed over
time, servers might keep more consistently busy     − The Critical Path
delivering good average service times, rather       As mentioned earlier, some I/O operations,
than idly waiting for the next bus.                 such as log writes and rollback I/O, are on the
− Physical Distribution                             critical path to transaction completion. Others,
                                                    such as log archive writes and checkpoint
The mapping of Oracle data files across             writes, are not, but must keep pace with
available channels and storage can be extremely     sustained demands. This difference should be
important to database performance.          The     kept in mind during physical database design.
diagnosis and correction of "hot spots" or
"skew" is a common activity for DBA’s, system       For    example,       the   most  demanding
administrators, and storage administrators.         INSERT/UPDATE         workloads may require
                                                    dedicated disks       or low latency cached

    4
subsystems to allow the online logs to keep pace    − Concurrency
with transactional demand.
                                                    I/O concurrency with Oracle can arise from
− Reduce, Reuse, Recycle                            several sources, including:
This modern ecological mantra succinctly               The client population.
encapsulates a great deal of the philosophy of         Use of PARALLEL features.
I/O tuning and cache management.                       Use of log writer or DB writer I/O slaves .
   Reduce − the best I/O is one that never             Use of AIO.
   occurs. Many programming and tuning                 Multiple DB writer processes.
   techniques have this as the underlying
                                                    Of these, the most common factors impacting
   strategy.
                                                    demand for write concurrency are AIO and the
   Reuse − this includes a broad variety of         use of multiple DB writers. Simply put, AIO
   topics including pinning objects with            allows greater I/O demand creation by a
   DBMS_SHARED_POOL.KEEP,        using    a         smaller number of Oracle processes with the
   separate KEEP pool, and exploiting               least overhead. Alternate schemes of achieving
   filesystem buffering options.                    concurrency, such as using Oracle I/O slaves,
                                                    are in a different league from the I/O demand
   Recycle − this is the common concept
                                                    facilitated by AIO.
   underlying priority_paging (prior to OE
   8) and use of a separate RECYCLE pool in         Prior to Oracle 8, AIO and multiple DB writers
   Oracle.                                          were mutually exclusive. With earlier versions
                                                    of Oracle 8, AIO was not used. There have been
− System Global Area (SGA) Size                     various issues with AIO over the years which
Historically, 32−bit Oracle has allowed SGA         have led many users to disable AIO, often
sizes up to about 3.5 GB in the Solaris OE. 64−     without firm reasoning. Over time, many users
bit Oracle brings the possibility of extremely      have increased DB writers to its maximum
large SGA’s. Indeed, expanded SGA capability        setting of 10 to achieve write concurrency
is the single distinguishing feature of 64−bit      without AIO. Some problems have arisen from
Oracle.                                             users re−enabling AIO with maximal DB writer
                                                    settings already in place.
Enlarging db_block_buffers is the principle
means of exploiting large system memory.            In theory, each DB writer can issue up to 4096
Proper sizing and use of the second largest         I/O requests. In practice, we often see DB
component of the SGA, the shared pool, is key       writer processes queuing about 200 requests
to performance for many workloads.                  each. In either case, these are large numbers
                                                    relative to the ability of most systems to
At any rate, SGA size must not exceed a             concurrently service requests. Pushing an I/O
reasonable proportion of system memory in           backlog from Oracle to the OS increases
relation to other uses. There are tradeoffs to be   queuing and correlates with increased rates of
made regarding the comparative advantages of        ORA−27062 incidents.
using memory for SGA versus OS page cache,
especially     in    consolidation     scenarios.   Therefore, in short, when using AIO,
Regardless of how employed, system memory           db_writer_processes          should    not   be
provides the potential for reducing physical I/O    increased beyond one (1) without demonstrable
demand by servicing logical reads from              business benefit, and tuning upwards from one
memory.                                             should be incremental. If AIO is switched off
                                                    for any reason, increasing DB writers is usually
Incidentally, Oracle 9i now allows dynamic          required to attain adequate write concurrency.
online resizing of the SGA in conjunction with
the new Dynamic Intimate Shared Memory              − Checkpoints & Log File Sizing
(DISM) feature of Solaris OE 8.
                                                    In the event of an abnormal shutdown, recovery
                                                    time will vary in proportion to the number of
                                                    dirty database blocks in the SGA. The longer a
                                                    data block is retained in the SGA prior to

    5
checkpointing, the more chance it has of            27062 incidents arising from intense write
absorbing INSERT or UPDATE data, and the            contention   for   these   areas.       Also,
greater the efficiency of the checkpoint write      recommendations regarding use of the
when it finally occurs. Deferral of checkpoint      TEMPORARY tablespace attribute for TEMP
writes using large logs leads to bursty             areas may vary between Oracle releases.
checkpoint activity, which in turn makes log
                                                    There are many other parameters pertaining to
archiving a relatively more bursty process, and
                                                    sort and hash join optimization, and [Alomari,
further implies that log shipping strategies will
                                                    2001] gives a good discussion of these.
be very bursty.
Thus, both the temporal distribution of             − Large I/O
checkpoint writes and aggregate writes over         The Solaris OE default maximum low−level I/O
time depend on the size of the logs and the         size is 128 KB6, but this can easily be tuned
setting of various parameters which control         upwards to 1 MB in OE 2.6, or as high as 8 MB
checkpointing. Tuning these factors represents      in OE 8. Regardless of kernel settings, any
tradeoff between performance and recovery           process can issue large I/O requests, but large
time.                                               requests may be broken up into smaller
This has been an area of significant                requests depending upon kernel settings and
development in Oracle. Historically, Oracle         volume management parameters. Regardless of
parameters log_checkpoint_interval and              whether I/O requests are broken up at the OS
log_checkpoint_timeout have impacted                level, using fewer larger operations for
this balance.     Oracle 8i variously offered       sustained    sequential   I/O    is    typically
db_block_max_dirty_target                 and       advantageous due solely to decreased overhead
fast_start_io_target in this arena.                 from process context switches.
Oracle 9i goes a step further in offering           Factors effecting the supply aspects of Large
fast_start_mttr_target          and   related       I/O do not fit neatly in the subsequent
parameters to explicitly control the time           discussion of I/O supply topics, so we will
expected for recovery.     Regardless of the        mention them briefly here. In short, supply−
method of control, the tradeoffs remain             side optimization of I/O requests involves
essentially the same.                               many factors, including disk layout, volume
One can observe checkpoint activity in the          management parameters, and the kernel’s
Oracle       alert      file     by     setting     maxphys setting. Filesystem factors such as
log_checkpoints_to_alert=TRUE,              but     clustering, the UFS maxcontig setting, or data
interpretation of the resulting messages varies     prefetching logic can also have significant
between Oracle releases.                            impact. The topic of bandwidth optimization is
                                                    the focus of many I/O whitepapers.
− Sorting and Hash Joins
                                                    In the case of sustained reading, large reads
Disk sorts are best avoided or optimized. Hash      might not always represent an ideal solution.
joins have a great deal in common with sorts,       Factors such as pre−fetching at the storage
but here we will comment only on sorting.           subsystem and filesystem levels, competition
                                                    for resources, and ’think time’ per unit data
The obvious and traditional means of reducing
                                                    may result in optimal throughput with
disk sorts is to set large values for
                                                    intermediate read sizes.     In other words,
sort_area_size. Many users are reticent to
                                                    effective pipelining can conceivably reduce
do this, however, fearing that aggregate
                                                    ’dead time’ that can occur while waiting for
memory demand from clients will be excessive.
                                                    large read requests to complete. In general,
However, only clients that do large sorts will
                                                    however, use of large I/O is often key to
demand a maximal sort_area_size from the
                                                    attaining maximum sequential throughput.
OS, so this prospect should not be feared as
much as it seems to be.                             Most I/O demand from Oracle occurs in
                                                    db_block_size units, and large I/O is used
It is worth noting here that since client shadow
processes can write directly into TEMP areas,           6
                                                          Controlled by maxphys in /etc/system,
there is some potential for client−side ORA−        which defaults to 131072.


    6
only in limited circumstances.   Parameters           from the OS page cache, the data must be re−
effecting large I/O have been the subject of          validated, thus creating some feature bias in
much revision between Oracle releases, but            favor of a larger SGA and unbuffered filesystem
some examples include:                                I/O.
   Full table or index scan operations − reads        While this feature was default OFF in earlier
   are (db_file_multiblock_read_count *               Oracle releases, db_block_checksum is TRUE
   db_block_size) bytes.                              by default in Oracle 9i.
   Disk sorts − sort_write_buffer_size                − Backup Schemes
   bytes in Oracle 7, dynamically sized in
   Oracle 8i.                                         Time spent in HOT BACKUP impacts aggregate
                                                      demand from log and log archive writes,
   Datafile creation − writes are issued in           because entire DB blocks must be logged for
   ccf_io_size bytes in Oracle 7, or                  tablespaces that are in BACKUP mode.
   db_file_direct_io_count (DB blocks)
   in Oracle 8.                                       Data snapshot technologies, such as Sun
                                                      StorEdge™ Instant Image, Veritas Volume
Some Oracle versions will revert the setting of       Replicator (VVR), or Sun StorEdge™ 9900
large I/O parameters to 128 KB defaults               ShadowImage may be directed at reducing this
without warning. The best way to confirm the          impact.
actual I/O size being used by an Oracle process
is to use the Solaris truss(1) command.               3. I/O Supply Topics
Setting      large     values     of     Oracle’s
db_file_multiblock_read_count has the                 This section is an annotated survey of various
side effect of biasing the optimizer’s preference     I/O supply factors in the I/O stack from the
towards using full scans. Because of this, there      bottom−up. High impact topics spring up at all
may be tradeoffs between different queries.           layers of the stack. Much as a chain is only as
One workaround to this is to implement large          strong as its weakest link, I/O supply can only
reads at the session level (ie: ALTER SESSION)        be as good as the worst bottleneck will allow.
so that the impact is not global.
                                                      Supply Characteristics
Memory constrained systems may suffer from
having excessive amounts of memory locked
                                                      Some additional terminology is useful for
down due to active large I/O operations. For
                                                      discussing I/O supply factors.
these reasons, the desirability of exploiting large
I/O will vary.                                        − Latency vs. Bandwidth
As a guideline, DSS and DW systems will tend          Defined simply, latency is the time required for
to enjoy large reads, while OLTP systems may          a single operation to occur, while bandwidth
not. Using large reads in conjunction with wide       characterizes the realized rate of data transfer.
disk striping is a common formula for                 Both latency and bandwidth will vary in part
optimizing sequential performance.       Tuning       with the size of the requested operations.
read size to exploit filesystem and hardware
pre−fetch characteristics is another means of         In latency terms, there are three          main
improving realized bandwidth.                         approaches to I/O performance issues:
                                                      1. Reduce baseline latency, such as by adding
− Data Checksumming                                      cache, improving cache retention strategies,
Oracle allows round−trip data integrity                  or by a baseline technology upgrade.
validation by generating checksum data for            2. Improve work per operation, such as by
each data block written, and validating the              using a larger I/O size.
checksum each time data is read. While the            3. Increase concurrency, such as by adding
algorithm used is extremely efficient, it does           channels or disks, spreading demand across
impose some additional latency on every read             available resources, or placing more
and write, and therefore has a slight throttling         demands upon existing resources.
effect on demand. Also, for each read satisfied


     7
Opportunities for these manipulations may be           Volume reconstruction, mirror re−silvering
presented at various layers of the I/O stack.          Physical data reorganization
− Variance                                             Transient utility operations
                                                       Other hosts in a SAN
In the case of a single simple disk, variance in       Data services (eg: volume copies)
service times is explained primarily by
                                                       Committal of previously cached write data
mechanical factors, such as rotational speed and
seek time. In the absence of hardware failure,     Even when these activities occur ’off host’, they
one could say that with such a simple system,      can compete for head movements and decrease
the variance is actually bounded by these well−    the net I/O supply to Oracle, both in terms of
known mechanical factors. In real systems,         IOPS and latencies.
sources of variance are diverse, and include:
                                                   − Throttles
   Process prioritization (both for LWP threads
   and simple UNIX processes).                     Anything that extends the code path for an I/O
                                                   implicitly adds to I/O latency. Compared to
   Geometry−based wait queue sorting.
                                                   other factors, marginal code path costs are
   Cross−platform competition in SANs.             relatively minor in practice. Some components
   Cache utilization.                              in the I/O stack explicitly enforce certain
   Bugs in the I/O stack.                          limitations.
Variance is infrequently measured and
characterized, but no advanced math or
                                                   Supply Factors − Bottom Up
statistics is needed to characterize the most
basic measure of variance − range. It is one       Each layer in the I/O supply chain can
thing to observe a range of I/O service times in   contribute significant complexity. This survey
the range of, say, 5 to 500 milliseconds, but      of the I/O stack discusses a limited set of issues
quite another to observe variance on the range     at each layer.
of 5 milliseconds to 5 minutes!                    − The Disk Spindle
Concern over variance might be summarized          The ultimate limiting factors to sustainable
like this:                                         IOPS in a complex system are the number of
 1. Sources of variance should be identified       disks and their operational characteristics.
    and scrutinized.                               Effective cache utilization at the system and
 2. Sources of unbounded variance are bad          storage subsystem layers may increase the
    and beg to be eliminated.                      supply of logical IOPS, but when caches and
                                                   buses are saturated, it is the humble disk
 3. Sources of bounded variance beg to be
                                                   spindle that sets the speed limit.
    purposefully controlled.
                                                   The primary disk characteristics are rotational
I/O statistics are most often cited as averages
                                                   speed and seek time. Disks vary in many other
and sums, and most folks expect variance to be
                                                   ways though, including number and type of
’reasonable’ about the mean. Certainly, ORA−       access paths; quantity of onboard cache and
27062 incidents are categorically examples of      intelligence of onboard cache management; use
when this fails to occur.                          of tagged commands; and sophistication of
− Competition                                      error recovery logic.     From time to time,
                                                   significant disk performance issues may be
I/O supply to Oracle is net of any competition.    discovered in disk firmware.
In other words, one can say that the supply of
I/O to Oracle is diminished by competing I/O       − Storage Subsystems
demands involving the same resources.              The range of features available from storage
Possible sources of competition are numerous,      subsystems has mushroomed in recent years.
and some are easily overlooked. Consider           The spectrum of subsystems ranges from the
these:
   Backup and recovery operations


    8
basic JBOD7 array to sophisticated RAID              9900 series arrays, may predict the likely failure
subsystems and Storage Area Networks                 of a particular disk, and evacuate its data far
(SANs).      Even humble JBOD arrays have            more efficiently than by RAID 5 reconstruction.
developed sophisticated features such as
                                                     It is sometimes overlooked that in the absence
environmental monitoring and fault isolation
                                                     of writes, RAID 5 read performance should be
capabilities.
                                                     expected to be comparable with RAID 0
Intelligent arrays can take complexity off the       performance, since RAID 5 parity data is
host and facilitate new data management              usually never referenced unless a disk fails8.
paradigms, such as off−host backups,
contingency copies, and sophisticated disaster       − Storage Subsystems: Caching
recovery strategies, in addition to providing        Common to hardware RAID implementations is
basic RAID features. For this discussion of I/O      the use of persistent cache memory.          Cache
supply, only a small subset of these features will   management in external RAID offers the
be covered, including, RAID, caching, and data       possibility of enhancing I/O supply to attached
services.                                            systems variously by caching writes or by
                                                     facilitating pre−fetching or other intelligent data
− Storage Subsystems: RAID
                                                     retention schemes.
When implemented in an integrated subsystem,
                                                     When external cache is disabled or saturated,
RAID features are variously called ’hardware
                                                     subsystem performance as seen by attached
RAID’, ’box−based RAID’, or ’external RAID’.
                                                     hosts will degrade. External cache will typically
In reviewing various RAID levels, an essential       cease to be used for caching writes if its battery
point to note is that ’striping’ across disks is     backup becomes dubious, but read caching is
essential to achieving sustainable bandwidth         typically not impacted when cache persistence
greater than that of a single spindle. RAID 0 is     becomes questionable. Control over caching
merely striping across disks with no data            policies, such as pre−fetching or random read
protection features.      RAID 5 and similar         data retention, varies tremendously between
schemes also involve striping across disks.          different subsystems, and are often controllable
                                                     on a per−LUN9 basis.
In the case of RAID 5, since parity must be
generated and saved to disk, write activity will     One simple fact that is often overlooked in
have the effect of causing additional mechanical     evaluating empirical I/O statistics is that writes
activity that may compete with read requests.        which appear very low−latency to a host are not
Sophisticated integrated storage subsystems          truly complete until they are flushed from the
like the Sun StorEdge™         T3 or the Sun         external cache.     This flushing activity can
StorEdge™ 9900 series arrays, are designed to        compete with new I/O requests for head
minimize this impact by intelligent I/O              movement and therefore increase latency,
scheduling.                                          especially for new reads. Correlation of current
                                                     I/O activity with recent writes may be difficult
In the event of a complete disk failure in a         with host−based tools, but one should keep in
RAID 5 set, all disks in the set must be read to     mind that "what goes up, must come down".
reconstruct lost data, and that process can
decrease the net supply of IOPS quite seriously.     It is widely considered that RAID 5 performs
Once a replacement disk is installed, the process    less well on writes than alternatives such as
of rebuilding a failed spindle involves reading      RAID 1. While this may be true for host−based
all of the surviving disks completely, and that      RAID 5, it is not categorically true for
can produce significant competition with user        hardware−based RAID solutions. So long as
demand. RAID 5 implementations typically             write demand remains at levels that do not
feature some means of tuning the rebuild rate to     saturate the cache, all RAID levels look
avoid unacceptable competition with live
workloads. Some more sophisticated RAID 5                 8
                                                            An exception would be ’parity checking’
implementations, such as in the Sun StorEdge™        operations implemented in some subsystems.
                                                          9
                                                            A LUN, or Logical Unit Number, is the term
    7
      JBOD means ’Just a Bunch Of Disks’, and has    most often used to designate a virtual disk managed
come to be common terminology in the industry.       by an intelligent storage subsystem.


     9
essentially the same to attached hosts so far as    cylinders of disks contain more data10, and
writes are concerned. That is, writes to cache      therefore provide slightly better performance
will be very low latency until and unless the       than the innermost cylinders. Exploitation of
cache becomes saturated.                            this characteristic is sometimes called ’radial
                                                    optimization’.
− Storage Subsystems: Data Services
                                                    In practice, disk layout is something of an art,
Persistent cache in external arrays can also be     though its underpinnings are all scientific. It
leveraged in implementing efficient box−based       can be helpful, if not essential, to construct
data services, such as Sun StorEdge™ 9900           drawings of available disks, channels, and their
ShadowImage and Sun StorEdge™ 9900                  usage to be able to comprehend any given disk
TrueCopy. As with host−based data services,         layout. Especially when a great deal of the
these can contribute some competition and           complexity may be housed in an intelligent
variance to the I/O stack. Any synchronous          storage subsystem, effective visualization of the
data    forwarding     product    can     inherit   issues may otherwise be impossible.
tremendous variance from the Quality of
Service (QOS) of the underlying data link.          − Disk Interconnect
− Layout Strategies                                 The most conspicuous characteristics of a disk
                                                    interconnect are its signaling rate and its
Whether or not an intelligent storage subsystem     protocol overhead. 100 MB/sec (1 gigabit per
is involved, placement of data across available     second) Fiber Channel Arbitrated Loop
disks can have significant impact on the actual     (FC/AL)      interconnects    are   now     the
supply of I/O. Regardless of how it comes to        commonplace building blocks for modern SAN
be, if two ’hot spots’ come to reside at opposite   architectures. Speeds of 2 Gbit/sec and beyond
ends of the same disk spindle, performance will     are on the horizon, as are alternate protocols
suffer. With modern SANs, competition for           such as iSCSI and Infiniband.
head movement may even originate from
different hosts.                                    Besides the obvious bandwidth limitations
                                                    imposed by the native signaling rate, there are
The trend toward higher disk densities and          two aspects of FC/AL that deserve special
lower IOPS per disk gigabyte has led to             mention. First, as a loop−based protocol, some
substantially different layout strategies from      scaling issues can arise if too many targets share
just a few short years ago when use of many         the same loop. Second, due to the ’arbitrated
smaller spindles was the norm. Right or wrong,      loop’ aspect of FC/AL, its performance can
the most popular strategy a few years ago was       degrade rapidly with distance. For this reason,
to segregate I/O by class, whereas now the          SAN switches employing more efficient Inter−
more popular strategy is far more bias towards      Switch Link (ISL) protocols may be required to
interleaving I/O classes.                           effectively extend FC/AL over distances.
Oracle has promoted a "Stripe and Mirror            Alternate interconnect paths can variously serve
Everything" or "SAME" approach, while other         to add resilience to a system or be exploited for
writers have presented similar concepts             performance. Management of alternate paths is
variously as "Wide−Thin Striping" or "Plaid"        implemented in several software products
layout strategies.    The common underlying         above the interconnect layer. Multiplexed I/O
notion is to distribute all (or most) Oracle I/O    (MPXIO) provides OS level awareness of
uniformly across a large spindle population,        multiple paths as an option for Solaris OE 8.
thereby distributing demand across available        Veritas Volume Manager provides a Dynamic
spindles and minimizing the likelihood of           Multipathing (DMP) capability. Alternate paths
excessive queuing for any single disk. These        to a Sun StorEdge™ 9900 series array can be
strategies are often implemented using a            managed with Sun StorEdge™ 9900 Dynamic
combination of hardware−based RAID and              Link Manager. Each of these products has its
host−based RAID.
Another consideration commonly made in
layout strategies is the fact that the outer            10
                                                           This is due to variable formatting known as
                                                    Zone Bit Recording, or ’ZBR’.


    10
own feature set and capabilities with respect to    controls should be manipulated only with a full
I/O supply and error handling.                      understanding of the desired objective and the
                                                    possibility of unintended consequences.
− Interface Drivers
                                                    One aspect of the sd layer that is presently not
A step above the host bus adaptor (HBA)             adjustable is logic that continuously sorts the
hardware is a device driver software stack.         disk wait queue according to the geometry of
This stack begins with HBA−specific drivers         the underlying disk. While this is quite useful
(eg: esp, isp, fas, glm, pln, soc,                  with direct attach JBOD disks, the geometry of
socal), which are integrated using the Device       external disk subsystems presented to the OS is
Driver Interface (DDI) framework. Higher level      usually purely fictional. Write caching and I/O
functions are provided by target drivers like sd    scheduling in external RAID subsystems make
(for SCSI devices) and ssd (for FC/AL devices).     this sd feature of dubious value, and it could
Some key quotas and controls for disk I/O are       conceivably introduce undesirable variance to
implemented in the sd and ssd drivers.              response times. The ability to control this may
Despite the close relationship between the sd       appear in a future release of the Solaris OE.
and ssd drivers, their controls are distinct.
                                                    − In−host Write Caching
Generally speaking, variables discussed here for
the sd driver are the same in the ssd driver,       The Sun StorEdge™ Fast Write Cache (FWC)
but are named with ssd as the initial letters.      (see [Sneed, 2001]) provides an in−host
                                                    redundant and persistent write cache which can
Error     recovery    parameters,      such    as
                                                    greatly reduce the write latency seen by Oracle.
sd_retry_count and sd_io_time, control
                                                    While FWC imposes a throttle on bandwidth
the time required for the driver stack to return
                                                    and is incompatible with cluster environments,
control to higher layers of the I/O stack in the
                                                    its reduction in latency can be quite beneficial to
event of hardware failures. By default, these
                                                    processes which depend on serially dependent
allow some failures to take up to five minutes to
                                                    writes on the critical path to transaction
be reported to upper layers. Varying these
                                                    completion, such as Oracle’s log writer.
controls can be tricky business, as they also
pertain to the resilience of systems during         Another means of in−host write caching is use
power up operations. This is a topic that begs      of cached RAID interface cards, such as the Sun
for further development.                            StorEdge SRC/P Intelligent SCSI RAID
The maximum number of concurrent requests           Controller™ System.
that can be passed to the HBA driver is             − Host−based Data Services
controlled by sd_max_throttle.          While
certain HBA hardware or attached storage may        A wide variety of host−based data services are
be limited with respect to its capacity for         available, including Sun StorEdge™ Instant
concurrency, this setting has systemwide            Image (II), Sun StorEdge™ Network Data
impact which can have adverse performance           Replicator (SNDR), Veritas Volume Replicator
implications. By default, this value is 256,        (VVR), and an assortment of filesystem
which maximally exploits SCSI tagged command        snapshot facilities. These categorically impact
queuing capabilities.                               aggregate physical I/O demand, and generally
                                                    introduce some variance, throttling, and
Several systemwide SCSI features can be
                                                    temporal dependencies in maintaining data
controlled    by     the     bitmask     variable
                                                    structures such as bitmaps.
scsi_options, including tagged command
queuing and control over the maximum transfer       As with their externally−based counterparts,
rate which can be negotiated for SCSI transfers.    synchronous remote replication products can
Control over specific interfaces and targets may    introduce significant or pathological throttling
be available at the HBA driver level, such as       and variance depending on the Quality of
described in glm(7D). It would appear that the      Service (QOS) of the network interconnects.
scsi_options parameter is frequently                Nevertheless, as a class, these products
manipulated by third party storage vendors,         effectively facilitate a wide range of cost
perhaps often when it should not be. These          effective business solutions. System designs


    11
should incorporate appropriate considerations          One VxVM issue to which ORA−27062
to assure success with these products. For             incidents were explicitly tied pertained to
example, cached and/or dedicated storage               systems using Dirty Region Logging (DRL’s)
might be allocated for bitmap data structures.         with mirrored volumes11 under heavy load. An
                                                       I/O could get ’hung’ awaiting a DRL update,
− Volume Management                                    thus leading to extreme variance.
Volume managers, such as the Veritas Volume            Another recent issue was noted in VxVM 3.0.1
Manager (VxVM) or Solstice DiskSuite™ (SDS)            where the default read policy for mirrored data
provide myriad techniques for manipulating             was changed from a prior default that tended to
I/O supply. The code path overhead of Volume           favor one side of the mirror for a series of
Management is usually considered to be so              sustained reads12. Under the prior default with
slight as to be insignificant, but several topics in   JBOD arrays, sustained reading would benefit
volume management can have extreme                     from revisiting the spindle cache of the chosen
significance. Volume Managers provide host−            disk. This issue surfaced in terms of reduced
based RAID capabilities, which is where many           supply of sequential performance.
layout decisions are typically implemented.
                                                       − Filesystem Cache
Host−based RAID 0, RAID 1, and hybrids of the
two actually impose very little overhead               Simply put, filesystem caching is bad for
relative to the native capabilities of the attached    synchronous writes, but good for reads that are
storage.      Host−based RAID 5, which is              satisfied from it. That being said, it may not be
implemented at the Volume Manager layer, is            simple to determine the optimal filesystem
entirely another matter. Since host memory is          caching scheme for a given Oracle workload.
not persistent like the caches of hardware RAID        Filesystem cache performance and               its
subsystems, RAID 5 parity data must be flushed         contribution to Oracle performance can vary
along with each synchronous write in realtime,         greatly depending upon caching options used
and this is typically disastrous relative to           and the amount of memory available. With 32−
database write requirements. On the other              bit Oracle, use of the filesystem cache is a
hand, some very large DSS databases might              principal means of exploiting large server
rationally employ economical host−based                memories.
RAID 5 for predominantly read−oriented tasks           In the Solaris OE, the filesystem cache is not a
where the write penalty is not too severe with         fixed size area, but rather lives in the OS page
respect to the rate of data updating. Host−            cache, so the terms are used interchangeably in
based RAID 5 exhibits very good read                   this discussion. See [Mauro & McDougall,
bandwidth.                                             2001] for more information on Solaris memory
Due to the popularity of VxVM on Sun™                  management.
systems, VxVM issues tend to have a high rate          The ’extra copy penalty’ often cited against
of impact when they occur. Therefore, a few            using filesystem cache is not highly significant
details specific to VxVM are offered here.             in the overall scheme of things. Copying data in
With mirrored volumes implemented in VxVM,             memory is an operation at which modern Sun™
recovery time after an abnormal system                 systems are particularly adept.
shutdown can be greatly reduced by
implementing Dirty Region Logging (DRL’s). It
                                                       − Filesystem Block Size
is generally recommended to group all DRL              Since    Oracle    I/O     is   typically   in
areas for a system on a few disks dedicated to         db_block_size increments and 8 KB blocks
that purpose. When properly implemented,               or larger are becoming the norm, it is usually
DRL      maintenance    overhead     is    often       advised to use a VxFS filesystem block size of
characterized as reducing write supply by about        8 KB. Smaller block sizes will decrease supply
10%. Without DRL’s, recovery after a crash can         and cost in terms of CPU usage when used on
require complete resilvering of disrupted
mirrors.                                                    11
                                                               See Bug Report #4333530, fixed at VxVM 3.1
                                                       and by current patch to version 3.0.4.
                                                            12
                                                               See Bug Report #4255085, fixed at VxVM 3.1.


     12
Oracle data filesystems that never see I/O               Using VxFS Quick I/O (QIO)
requests smaller than 8 KB.                              Using the UFS concurrent forced direct I/O
By default, VxFS selects the filesystem block            feature (forcedirectio) available in
size at the time of filesystem creation based on         Solaris OE 8, Update 3 or later.
filesystem size, without regard for its intended         Using the Sun™ QFS filesystem Q−Write
use. Therefore, one must explicitly specify a            feature.
block size option to mkfs13 to assure consistent     For users on filesystems, implementing VxFS
results.                                             QIO, UFS forcedirectio, or Sun™ QFS Q−Write
On UFS, 8 KB is the only blocksize permitted on      requires no movement of data. In contrast,
modern Sun™ systems.                                 conversion from filesystem to RAW would
                                                     require completely offloading and reloading the
− Filesystem Throttles                               data,    as    would     conversion    between
UFS has tunable thresholds at which it will          filesystems14.
suspend     and    resume    processes   with        Other points to ponder about these options
uncommitted deferred write data, and thus            include:
prevent a single process from overrunning the
page cache or using an inequitable share. A             QIO and RAW share the advantage of
’high−water mark’ (UFS:UFS_HW) and ’low−                exploiting the Kernel Asynchronous I/O
water mark’ (UFS:UFS_LW) implement this                 (KAIO) code path, which is capable of
throttle.                                               supplying more IOPS than the LWP AIO
                                                        code path.
A similar throttle is introduced with VxFS
Version 3.4.     Absent such a throttle, the            QIO offers the ability of per−file control of
importance of     aggressive Virtual Memory             filesystem caching (’Cached QIO’) and the
Management with deferred buffered I/O was               ability to observe read hits rates in the OS
magnified (see [Sneed, 2000]).                          page cache, while UFS forcedirectio
                                                        categorically disables OS buffering.
The mere presence of a filesystem code path has
some I/O throttling effect relative to RAW disk,        QIO introduces some operational complexity
but this turns out to be a very minor factor in         relative to UFS forcedirectio.
most cases.                                             QIO is a separately licensed feature, while
                                                        UFS forcedirectio comes with the Solaris OE.
− Filesystem Single−Writer Lock
                                                     Avoidance of the single writer lock and use of
Perhaps the single most significant factor that
                                                     the KAIO code path are known techniques for
limits Oracle write throughput with filesystems
                                                     reducing the probability of ORA−27062
is the "single writer lock" constraint required by
                                                     incidents. Each option for doing this has its
POSIX standards to assure proper ordering of
                                                     own pros and cons.
synchronous      writes.        Increased    write
concurrency at the Oracle level is for naught        − Asynchronous I/O (AIO)
when the filesystem serializes the writes. The
                                                     The Solaris OE offers two alternate AIO
impact of the single writer lock is reduced with
                                                     Application Programming Interfaces (API’s).
externally cached storage, which provides low
                                                     One conforms to the POSIX Real Time (RT)
write latency and correspondingly lowered lock
                                                     specifications       (eg:      aio_read(3RT),
dwell times.
                                                     aio_suspend(3RT), et al) implemented in
Since Oracle handles its own write ordering          librt(3LIB), and the other conforms to
considerations, the single writer lock common        SUNW         private      specifications  (eg:
to standards−conforming filesystems is both          aioread(3aio), aiowait(3aio), et al)
unnecessary and injurious to Oracle I/O              implemented in libaio(3LIB).
performance. There are several ways to avoid
                                                     Oracle uses the SUNW API.
the single writer lock altogether, including:
    Using RAW disk volumes, LUNs, or slices.             14
                                                            In some cases, tools may exist to allow in−situ
    13
      See mkfs_vxfs(1M).                             conversion of filesystems.


    13
The SUNW implementation in libaio includes         (also called a ’LWP’ or ’thread’), and that these
a hard−coded throttle called _max_workers          threads have their priorities shuffled in the
which limits the number of active requests per     same manner as other threads in the TS
process. This value is set to 25615. A process     scheduler class. This can give rise to variance in
can queue requests past this limit subject to      response times seen by Oracle, and with
available memory, but queuing far past this        extreme loads, this variance can be pathological.
limit offers no tangible benefit.
                                                   Applied incorrectly, priority manipulations can
For more information on SUNW AIO                   lead to seriously adverse outcomes.
implementation and usage, see [Mauro &
McDougall,    2001] and   [McDougall &             Conclusion
Gummuluru, 2001].
                                                   Hopefully, some of the perspective and details
Kernel Asynchronous I/O (KAIO), which is
                                                   offered here will be found to be useful in
used on RAW and VxFS QIO files, is often cited
                                                   applied terms.
as yielding 10−12% gains on extremely write−
intensive benchmark workloads. However, for        Each layer of the I/O stack can be meaningfully
many workloads its impact may be negligible,       discussed in terms of its impact on the supply
especially considering that many database          of I/O to the layers above it. I/O demand
workloads are either extremely read biased or      results in an equilibrium that depends on
more constrained by bandwidth than write           numerous supply factors. This perspective can
concurrency.                                       be useful in comprehending the complexity of
                                                   the I/O stack, and in understanding I/O
As mentioned earlier, it is not advisable to use
                                                   options and tradeoffs.
too many DB writer processes when AIO is
used.
                                                   References
− Process Scheduling
                                                   Oracle publications are generally available
Last but not least, Oracle processes cannot
                                                   online via the Oracle Technology Network
generate I/O demand unless they are in
                                                   (OTN) Web site at http://otn.oracle.com.
memory, ready to run, and not deprived of
                                                   Membership is free.
needed CPU and memory resources. Aside
from the topics of having the CPU and memory       [ORA_9i_PER] − "Oracle 9i Database
appropriately sized for the workload, this         Performance Guide and Reference", Oracle
relates to several system management topics        Corporation Part No. A87503−02, 2001.
including Virtual Memory Management (VMM)
                                                   [ORA_9i_TUN] − "Oracle 9i Database
configuration16 and manipulation of process
                                                   Performance Methods", Oracle Corporation
priorities.
                                                   Part No. A87504−02, 2001.
Process priority manipulation is a rather
                                                   [Alomari, 2000]  − "Oracle 8i and UNIX
complex topic in the Solaris OE, and a wide
                                                   Performance Tuning", Ahmed Alomari,
assortment of tools and techniques are
                                                   Prentice   Hall, September   2000, ISBN
addressed at this topic. dispadmin(1M)and
                                                   0130187062.
related topics from the ’SEE ALSO’ section of
the dispadmin man page are the main tools for      [Mauro & McDougall, 2001] − "Solaris
manipulating priority schemes. Other facilities    Internals", Jim Mauro & Richard McDougall,
include the venerable nice(1) command;             Prentice Hall, 2001, ISBN 0−13−022496−0.
processors sets (psrset(1M)); and the              [YAPP Method, 1999] − "Yet Another
separately licensed Solaris Resource Manager™.     Performance     Profiling Method   (YAPP
It is noteworthy that each AIO request to          Method)", a whitepaper by Anjo Kolk, Shari
filesystem files involves a lightweight process    Tamaguchi & Jim Viscusi of Oracle
     15
                                                   Corporation, 1999.
      This was 50 prior to Solaris OE 2.6.
     16
      See [Sneed, 2000] for more information on    [McDougall & Gummuluru, 2001] − "Oracle
VMM tuning and Intimate Shared Memory (ISM)        Filesystem Integration and Performance" − a
usage with Oracle.


     14
whitepaper by Richard McDougall and Sriram         Purcell (Sun) for their review and feedback.
Gummuluru or Sun Microsystems, January             Thanks also to Janet, my wife and volunteer
2001.                                              editor.   This document was created using
                                                   StarOffice™ software.
[Sneed, 2001] − "Sun StorEdge™ Fast Write
Cache Application Notes" − a whitepaper
presented at the April, 2001 Sun Users and         © 2001 Sun Microsystems, Inc. All rights reserved.
Performance Group (SUPerG) conference.             Sun, Sun Microsystems, the Sun logo, Solaris, Solaris
                                                   Resource Manager, StarOffice, Sun Blueprints, Sun
[Sneed, 2000] − "Sun/Oracle Best Practices" − a    QFS, Sun StorEdge, Sun StorEdge Instant Image, Sun
Sun Blueprints™ Online article available at        StorEdge Network Data Replicator, Sun StorEdge T3,
http://www.sun.com/blueprints/0101/SunOr           Sun     StorEdge    9900,   Sun   StorEdge      9900
acle.pdf.                                          ShadowImage, Sun StorEdge 9900 TrueCopy, Sun
                                                   StorEdge Fast Write Cache, Sun StorEdge SRC/P
                                                   Intelligent SCSI RAID Controller System, and
Acknowledgments                                    Solstice DiskSuite are trademarks or registered
                                                   trademarks of Sun Microsystems, Inc. in the United
Special thanks to Geetha Rao (Veritas), Vance      States of America. All SPARC trademarks are used
Ray (Veritas), Yuriy Granat (Oracle), and Jamie    under license and are trademarks of SPARC
Vuong (Oracle) from the Veritas/Oracle/Sun         International, Inc. in the United States and other
Joint Escalation Center (VOS JEC) for all their    countries. Products bearing SPARC trademarks are
contributions to the VOS JEC. Thanks to Jim        based upon an architecture developed by Sun
Viscusi (Oracle), Rey Perez (Sun), and Elizabeth   Microsystems, Inc.




    15

Más contenido relacionado

Similar a Oracle I/O Supply and Demand

Keynote by Mario Derba at Oracle Extreme Performance Tour
Keynote by Mario Derba at Oracle Extreme Performance TourKeynote by Mario Derba at Oracle Extreme Performance Tour
Keynote by Mario Derba at Oracle Extreme Performance TourMario Derba
 
Lecture about SAP HANA and Enterprise Comupting at University of Halle
Lecture about SAP HANA and Enterprise Comupting at University of HalleLecture about SAP HANA and Enterprise Comupting at University of Halle
Lecture about SAP HANA and Enterprise Comupting at University of HalleTobias Trapp
 
Operation research ppt
Operation research pptOperation research ppt
Operation research pptLakshmiPriyaM6
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristicsSalegram Padhee
 
Oracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethodsOracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethodsAjith Narayanan
 
Capital Investment Industrial Modeling Framework - IMPRESS
Capital Investment Industrial Modeling Framework - IMPRESSCapital Investment Industrial Modeling Framework - IMPRESS
Capital Investment Industrial Modeling Framework - IMPRESSAlkis Vazacopoulos
 
Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...
Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...
Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...IRJET Journal
 
Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...
Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...
Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...IBM India Smarter Computing
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Datalink oracle backup_recovery_white_paper
Datalink oracle backup_recovery_white_paperDatalink oracle backup_recovery_white_paper
Datalink oracle backup_recovery_white_paperMazen Orabi
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyDonna Guazzaloca-Zehl
 
EA & SOA AFCEA -Infotech2007
EA & SOA AFCEA  -Infotech2007EA & SOA AFCEA  -Infotech2007
EA & SOA AFCEA -Infotech2007Stephen Lahanas
 
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINAN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINcscpconf
 
PwC Tech Forecast Winter 2009
PwC Tech Forecast Winter 2009PwC Tech Forecast Winter 2009
PwC Tech Forecast Winter 2009Kurt J. Bilafer
 
big-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfbig-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfssuserd397dd
 
Oracle R12 OBIEE, Better Decisions Faster with Advanced Analytics
Oracle R12 OBIEE, Better Decisions Faster with Advanced AnalyticsOracle R12 OBIEE, Better Decisions Faster with Advanced Analytics
Oracle R12 OBIEE, Better Decisions Faster with Advanced AnalyticsSpiro (Stuart) Patsos
 
Thoughts on a research platform architecture: Simplify your application portf...
Thoughts on a research platform architecture: Simplify your application portf...Thoughts on a research platform architecture: Simplify your application portf...
Thoughts on a research platform architecture: Simplify your application portf...Pistoia Alliance
 

Similar a Oracle I/O Supply and Demand (20)

Keynote by Mario Derba at Oracle Extreme Performance Tour
Keynote by Mario Derba at Oracle Extreme Performance TourKeynote by Mario Derba at Oracle Extreme Performance Tour
Keynote by Mario Derba at Oracle Extreme Performance Tour
 
Lecture about SAP HANA and Enterprise Comupting at University of Halle
Lecture about SAP HANA and Enterprise Comupting at University of HalleLecture about SAP HANA and Enterprise Comupting at University of Halle
Lecture about SAP HANA and Enterprise Comupting at University of Halle
 
Operation research ppt
Operation research pptOperation research ppt
Operation research ppt
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristics
 
Oracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethodsOracle databasecapacityanalysisusingstatisticalmethods
Oracle databasecapacityanalysisusingstatisticalmethods
 
Capital Investment Industrial Modeling Framework - IMPRESS
Capital Investment Industrial Modeling Framework - IMPRESSCapital Investment Industrial Modeling Framework - IMPRESS
Capital Investment Industrial Modeling Framework - IMPRESS
 
Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...
Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...
Autonomous Platform with AIML Document Intelligence Capabilities to Handle Se...
 
Business Impacts on SAP Deployments
Business Impacts on SAP DeploymentsBusiness Impacts on SAP Deployments
Business Impacts on SAP Deployments
 
Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...
Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...
Storage for Oracle 11g IBM Storwize V7000 Unified Provides Enterprise-Class V...
 
Planuling & Phasing
Planuling & PhasingPlanuling & Phasing
Planuling & Phasing
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Datalink oracle backup_recovery_white_paper
Datalink oracle backup_recovery_white_paperDatalink oracle backup_recovery_white_paper
Datalink oracle backup_recovery_white_paper
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
EA & SOA AFCEA -Infotech2007
EA & SOA AFCEA  -Infotech2007EA & SOA AFCEA  -Infotech2007
EA & SOA AFCEA -Infotech2007
 
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINAN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
 
PwC Tech Forecast Winter 2009
PwC Tech Forecast Winter 2009PwC Tech Forecast Winter 2009
PwC Tech Forecast Winter 2009
 
big-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfbig-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdf
 
Oracle R12 OBIEE, Better Decisions Faster with Advanced Analytics
Oracle R12 OBIEE, Better Decisions Faster with Advanced AnalyticsOracle R12 OBIEE, Better Decisions Faster with Advanced Analytics
Oracle R12 OBIEE, Better Decisions Faster with Advanced Analytics
 
Thoughts on a research platform architecture: Simplify your application portf...
Thoughts on a research platform architecture: Simplify your application portf...Thoughts on a research platform architecture: Simplify your application portf...
Thoughts on a research platform architecture: Simplify your application portf...
 
Why OEE?
Why OEE?Why OEE?
Why OEE?
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Oracle I/O Supply and Demand

  • 1. Oracle I/O: Supply and Demand Bob Sneed (Bob.Sneed@Sun.Com) SMI Performance & Availability Engineering (PAE) SUPerG @ Amsterdam, October, 2001 Rev. 1.1 Abstract Storage performance topics can be meaningfully Operating Environment (OE) and underlying abstracted in terms of the economic notions of storage equipment. "supply" and "demand" factors. The analogy While Economics has created a wide variety of can be extended into characterizing I/O modeling strategies and computational problems as "shortages", and to the notion of techniques, the intention here is to steer away addressing shortages with both supply−side from formalism and merely borrow some and demand−side strategies. economic terminology at an abstract conceptual While there is often much that can be done with level. When I/O is involved in a problem at all, SQL tuning and schema design to get a given it may be possible to improve results with task done with the minimal amount of I/O, this strategies and tactics of either decreasing I/O paper focuses primarily on platform−level demand or increasing I/O supply. configuration and tuning factors. A wide After a few introductory metaphors, we will variety of such factors are discussed, including present several key Oracle demand topics, then disk layouts, filesystem options, and key Oracle survey many supply factors in the I/O stack of instance parameters. the Solaris OE − from the bottom up. Practical advice is offered on configuration strategies, tuning techniques, and measurement 1. Economic Metaphors challenges. Much of the language of Economics is easily Introduction adaptable to discussion of I/O topics. The field of Economics offers a wealth of − Equilibrium opportunities for phrasing metaphors and In Economics, the intersection of supply and analogies relevant to Oracle database I/O. demand curves indicates a market equilibrium While it is easy to get carried away with point. With I/O, statistics observed at the disk excessive creativity along these lines, the basic driver level represent an empirical equilibrium concept of distinguishing supply issues from of sorts based on the interaction of numerous demand issues can actually be quite useful for supply and demand factors. While a great deal explaining many database I/O options and of constructive tuning can be accomplished by tradeoffs. This frame of reference is no studying iostat data, these numbers do not substitute for proven analysis and tuning really illuminate demand characteristics much methodologies, but it is hoped that it will prove at all past the point of observing channel or disk useful in the various processes of architecting, utilization or saturation. configuring, and tuning Oracle to the Solaris™ When business throughput is meeting requirements and expectations, equilibrium 1
  • 2. measurements are interesting for capacity cache hits, once again using a unique planning and for spotting operational mechanism. anomalies. However, when business There exists no comprehensive instrumentation throughput is not satisfactory, equilibrium which spans all layers of the I/O stack. Some measurements may not help clearly identify interesting statistics, such as range of variance options for improvement. at various levels, are not generally observable. − Measuring Supply & Demand There is certainly room for progress in instrumenting the I/O stack to get a clearer In Economics, demand is said to vary with picture of what’s going on. price, and it is usually very difficult to characterize without conducting experiments. Oracle’s wait event statistics are the principal In this discussion, the notion of demand is being tool for observing what Oracle is waiting for, used absent the notion of economic cost, but the and how much stands to be gained by difficulty of characterization is similar to the improvements in specific areas. Guidance in economic case. interpreting these statistics is offered in several texts, including [YAPP Method, 1999] and With I/O, the term I/O operations Per Second [ORA_9i_PER]. While these tools and (IOPS) is frequently employed to characterize techniques undoubtedly provide the best both empirical and theoretical throughput holistic approach to Oracle I/O optimization, levels. This statistic is often of limited utility, as close cooperation between DBA’s, System theoretical and marketing IOPS numbers are Administrators and Storage Administrators is often difficult to match in practice, and the required. It is hoped that perspectives offered circumstances under which they occur is not here will be useful in the furtherance of such cited with any consistency. collaborations. Oracle STATSPACK1 reports are most useful for − Quotas & Tariffs determining actual I/O statistics per file since its measurements incorporate queuing delays The I/O stack contains a wide assortment of and throttles from the filesystem and volume throttle points and fixed limits. Knowing where manager layers, as well as possible read hits these are is the key to working with them or from the OS page cache. In contrast, iostat or working around them, just as everyone looks Trace Normal Form2 (TNF) data will only show forward to shopping at duty−free stores. what passes to the physical I/O layer of the OS. − I/O Surplus Oracle’s measurements are per file, while iostat measurements are per device. In Economics, surpluses are associated with Untangling the mapping of files to volumes, lower prices and modern phenomena like channels, and devices can be a laborious task. It inventory write offs. With I/O, it would appear may add to the confusion that an Oracle that the main consequence of excess supply is PHYSICAL I/O is a LOGICAL I/O to the OS, wasted capital. and thus may reflect cache hits in the OS page In practical terms, an excess I/O supply may cache. Indeed, to the degree that Oracle’s I/O have tangible advantages, such as allowing statistics do not correlate with OS I/O statistics, certain application optimizations, or providing the OS page cache provides the obvious the comfort of ’headroom’ from a configuration explanation. management perspective. Certainly, surplus Veritas offer statistics at the Volume Manager supply is better than a shortage, and properly level for various objects, using its own means configured systems should possess some degree of data capture and presentation. With the of surplus for the sake of headroom alone. Veritas Filesystem’s Quick I/O (QIO) option, a − I/O Shortage means is provided to observe per file OS page In Economics, the consequences of shortages include higher prices, longer lines, and maybe 1 The ancestor of STATSPACK prior to Oracle 8 is even famine or despair. With I/O, the simple BSTAT/ESTAT reporting. analogue to higher prices is increased latency 2 See prex(1) et al. 2
  • 3. due to queuing, and famine and despair might some subsequent points of discussion will be terms with which a user could relate when include their bearing on ORA−27062 exposure. system response is especially poor. The only effect one normally expects from an 2. Oracle I/O Demand I/O shortage is degraded response and longer Many design and tuning techniques are aimed run times. However, worse things are possible. at reducing I/O demands. Conventional With Oracle, for example, when using wisdom says that 80% of the gains possible for Asynchronous I/O (AIO), any request noted to an application lie in these areas. This take longer than 10 minutes usually3 results in a conventional wisdom often leads to neglect of fatal ORA−27062 incident. Failure to update a many important factors at the Oracle and control file can cause a fatal error after 15 platform levels. The discussion offered here minutes. largely avoids tuning topics found in many It is rare that timeouts like these occur purely books, such as database schema design, due to I/O shortages. They are normally application coding techniques, data indexing, expected only in the event of hardware failures. and SQL tuning. However, it would appear that the vast majority of Oracle timeout events actually result Oracle Demand Characteristics with no hardware error present, and usually at load levels where one would not predict an I/O Different Oracle operations have different I/O could possibly take 10 minutes. demand characteristics, and understanding these is important to formulating useful − Supply Chain Failures strategies and tactics for tuning. Some The I/O stack is like a supply chain, and vocabulary is introduced here for use in failures at lower levels cause a domino effect of subsequent discussion of controlling Oracle I/O problems through the higher layers. Here demand. again, there is room for improvement in instrumenting the I/O stack to simplify − Synchronous Writes isolation of problem areas. All Oracle writes to data files are data synchronous5 . That is, regardless of filesystem − Death and Taxes buffering, writes must be fully committed to Hardware failures and error recovery can persistent storage before Oracle considers them seriously skew performance, or even lead to complete. This synchronous completion criteria ORA−27062 timeout incidents. Measures taken is quite distinct and separate from whether or to insure against such failures, like RAID data not writes are managed as AIO requests. protection or Oracle log file multiplexing, can each take their toll − much like a tax or life − Latency, Bandwidth & IOPS insurance premium. Hardware failures and These three common and closely related compromises to insure against them are just as measures of I/O performance each have their certain as death and taxes. place. Certain bugs have been associated with ORA− The most latency sensitive I/O from an Oracle 27062 incidents, but perhaps the most common transaction perspective are physical I/O cause arises from AIO to filesystem files due to operations on the critical path to transaction an issue recently identified with the AIO library completion. This includes all necessary read and the timesharing (TS) scheduler class4 by operations and writes to rollback and log areas. Sun BugID 4468181. In the interest of In contrast, checkpoint writes are not latency minimizing deaths from ORA−27062 incidents, sensitive per se, but merely need to keep pace with the aggregate rate of Oracle demand. 3 As of Oracle 8.1.7.2, or with patches to selected 5 prior releases, the message "aiowait timed out" may This is due to Oracle’s use of the O_DSYNC occur 100 times before becoming fatal condition. flag in its open(2) operations, but it is inherent with 4 See priocntl(1) and priocntl(2). RAW disk access. 3
  • 4. Decision Support Systems (DSS) and Data The art of database layout includes application Warehouse (DW) loads typically depend a great of principles such as I/O segregation, I/O deal on pure bandwidth maximization. Also, interleaving, and radial optimization. Layout operations like large sorts can depend heavily decisions can be significantly constrained by on effective bandwidth exploitation. In available hardware and project budgets. contrast, Online Transaction Processing (OLTP) Disoptimal layout can decrease I/O supply by workload performance frequently hinges more causing disk accesses to compete excessively on random IOPS. among themselves; by disoptimal use of − Aggregate Demand available cache; or by placing bottlenecks in the critical path of temporally dependent Different programming and tuning factors can operations. significantly impact the total amount of I/O demanded to execute a given set of business − Localities of Reference tasks. The sum total of I/O required can The concept of localities of reference is that certain significantly impact run times and equipment areas of a larger set of data may be the focus of utilization. frequent accesses. For example, certain indexes − Temporal Dependency might be traversed quite frequently, yet represent a small subset of the total data. Some things have to happen before other things can happen. This simple notion of time ordered Access within good localities of reference stands dependencies is all that temporal dependency to gain from optimal cache retention strategies. means. For example, log file writes must Cache efficiency will vary depending on how complete before an INSERT or UPDATE data retention strategies align with actual transaction can complete, and before the localities of reference. Opportunities for corresponding checkpoint writes can be issued. caching occur in the Oracle SGA, the OS page cache, and external storage caches. Each offers − Temporal Distribution different controls over retention strategies, and In this context, the term temporal distribution these may be configurable to achieve means "distribution of demand over time". The compounded beneficial effects. terms "bursty" and "sustained" are common characterizations of temporal distribution. In Controlling Oracle I/O Demand some cases, tuning to distribute "bursty" demand over time will result in improved We’ll skip the obvious topics of proper database service times and throughput. design and index utilization, and focus on the areas typically controlled by an Oracle DBA. As an analogy, consider the arrival of a bus at a Topics discussed here represent several of the fast food store. The average service time for I/O issues customers commonly wrestle with each customer will be seriously impacted by the Oracle, and a few of the "big knobs" used to length of the line. If the same customers control these factors. Some of these topics strolled in individually over a period of time, pertain to the "Top Ten Mistakes Made by they would likely enjoy a better average service Oracle Users" presented in [ORA_9i_TUN]. time. With demand evenly distributed over time, servers might keep more consistently busy − The Critical Path delivering good average service times, rather As mentioned earlier, some I/O operations, than idly waiting for the next bus. such as log writes and rollback I/O, are on the − Physical Distribution critical path to transaction completion. Others, such as log archive writes and checkpoint The mapping of Oracle data files across writes, are not, but must keep pace with available channels and storage can be extremely sustained demands. This difference should be important to database performance. The kept in mind during physical database design. diagnosis and correction of "hot spots" or "skew" is a common activity for DBA’s, system For example, the most demanding administrators, and storage administrators. INSERT/UPDATE workloads may require dedicated disks or low latency cached 4
  • 5. subsystems to allow the online logs to keep pace − Concurrency with transactional demand. I/O concurrency with Oracle can arise from − Reduce, Reuse, Recycle several sources, including: This modern ecological mantra succinctly The client population. encapsulates a great deal of the philosophy of Use of PARALLEL features. I/O tuning and cache management. Use of log writer or DB writer I/O slaves . Reduce − the best I/O is one that never Use of AIO. occurs. Many programming and tuning Multiple DB writer processes. techniques have this as the underlying Of these, the most common factors impacting strategy. demand for write concurrency are AIO and the Reuse − this includes a broad variety of use of multiple DB writers. Simply put, AIO topics including pinning objects with allows greater I/O demand creation by a DBMS_SHARED_POOL.KEEP, using a smaller number of Oracle processes with the separate KEEP pool, and exploiting least overhead. Alternate schemes of achieving filesystem buffering options. concurrency, such as using Oracle I/O slaves, are in a different league from the I/O demand Recycle − this is the common concept facilitated by AIO. underlying priority_paging (prior to OE 8) and use of a separate RECYCLE pool in Prior to Oracle 8, AIO and multiple DB writers Oracle. were mutually exclusive. With earlier versions of Oracle 8, AIO was not used. There have been − System Global Area (SGA) Size various issues with AIO over the years which Historically, 32−bit Oracle has allowed SGA have led many users to disable AIO, often sizes up to about 3.5 GB in the Solaris OE. 64− without firm reasoning. Over time, many users bit Oracle brings the possibility of extremely have increased DB writers to its maximum large SGA’s. Indeed, expanded SGA capability setting of 10 to achieve write concurrency is the single distinguishing feature of 64−bit without AIO. Some problems have arisen from Oracle. users re−enabling AIO with maximal DB writer settings already in place. Enlarging db_block_buffers is the principle means of exploiting large system memory. In theory, each DB writer can issue up to 4096 Proper sizing and use of the second largest I/O requests. In practice, we often see DB component of the SGA, the shared pool, is key writer processes queuing about 200 requests to performance for many workloads. each. In either case, these are large numbers relative to the ability of most systems to At any rate, SGA size must not exceed a concurrently service requests. Pushing an I/O reasonable proportion of system memory in backlog from Oracle to the OS increases relation to other uses. There are tradeoffs to be queuing and correlates with increased rates of made regarding the comparative advantages of ORA−27062 incidents. using memory for SGA versus OS page cache, especially in consolidation scenarios. Therefore, in short, when using AIO, Regardless of how employed, system memory db_writer_processes should not be provides the potential for reducing physical I/O increased beyond one (1) without demonstrable demand by servicing logical reads from business benefit, and tuning upwards from one memory. should be incremental. If AIO is switched off for any reason, increasing DB writers is usually Incidentally, Oracle 9i now allows dynamic required to attain adequate write concurrency. online resizing of the SGA in conjunction with the new Dynamic Intimate Shared Memory − Checkpoints & Log File Sizing (DISM) feature of Solaris OE 8. In the event of an abnormal shutdown, recovery time will vary in proportion to the number of dirty database blocks in the SGA. The longer a data block is retained in the SGA prior to 5
  • 6. checkpointing, the more chance it has of 27062 incidents arising from intense write absorbing INSERT or UPDATE data, and the contention for these areas. Also, greater the efficiency of the checkpoint write recommendations regarding use of the when it finally occurs. Deferral of checkpoint TEMPORARY tablespace attribute for TEMP writes using large logs leads to bursty areas may vary between Oracle releases. checkpoint activity, which in turn makes log There are many other parameters pertaining to archiving a relatively more bursty process, and sort and hash join optimization, and [Alomari, further implies that log shipping strategies will 2001] gives a good discussion of these. be very bursty. Thus, both the temporal distribution of − Large I/O checkpoint writes and aggregate writes over The Solaris OE default maximum low−level I/O time depend on the size of the logs and the size is 128 KB6, but this can easily be tuned setting of various parameters which control upwards to 1 MB in OE 2.6, or as high as 8 MB checkpointing. Tuning these factors represents in OE 8. Regardless of kernel settings, any tradeoff between performance and recovery process can issue large I/O requests, but large time. requests may be broken up into smaller This has been an area of significant requests depending upon kernel settings and development in Oracle. Historically, Oracle volume management parameters. Regardless of parameters log_checkpoint_interval and whether I/O requests are broken up at the OS log_checkpoint_timeout have impacted level, using fewer larger operations for this balance. Oracle 8i variously offered sustained sequential I/O is typically db_block_max_dirty_target and advantageous due solely to decreased overhead fast_start_io_target in this arena. from process context switches. Oracle 9i goes a step further in offering Factors effecting the supply aspects of Large fast_start_mttr_target and related I/O do not fit neatly in the subsequent parameters to explicitly control the time discussion of I/O supply topics, so we will expected for recovery. Regardless of the mention them briefly here. In short, supply− method of control, the tradeoffs remain side optimization of I/O requests involves essentially the same. many factors, including disk layout, volume One can observe checkpoint activity in the management parameters, and the kernel’s Oracle alert file by setting maxphys setting. Filesystem factors such as log_checkpoints_to_alert=TRUE, but clustering, the UFS maxcontig setting, or data interpretation of the resulting messages varies prefetching logic can also have significant between Oracle releases. impact. The topic of bandwidth optimization is the focus of many I/O whitepapers. − Sorting and Hash Joins In the case of sustained reading, large reads Disk sorts are best avoided or optimized. Hash might not always represent an ideal solution. joins have a great deal in common with sorts, Factors such as pre−fetching at the storage but here we will comment only on sorting. subsystem and filesystem levels, competition for resources, and ’think time’ per unit data The obvious and traditional means of reducing may result in optimal throughput with disk sorts is to set large values for intermediate read sizes. In other words, sort_area_size. Many users are reticent to effective pipelining can conceivably reduce do this, however, fearing that aggregate ’dead time’ that can occur while waiting for memory demand from clients will be excessive. large read requests to complete. In general, However, only clients that do large sorts will however, use of large I/O is often key to demand a maximal sort_area_size from the attaining maximum sequential throughput. OS, so this prospect should not be feared as much as it seems to be. Most I/O demand from Oracle occurs in db_block_size units, and large I/O is used It is worth noting here that since client shadow processes can write directly into TEMP areas, 6 Controlled by maxphys in /etc/system, there is some potential for client−side ORA− which defaults to 131072. 6
  • 7. only in limited circumstances. Parameters from the OS page cache, the data must be re− effecting large I/O have been the subject of validated, thus creating some feature bias in much revision between Oracle releases, but favor of a larger SGA and unbuffered filesystem some examples include: I/O. Full table or index scan operations − reads While this feature was default OFF in earlier are (db_file_multiblock_read_count * Oracle releases, db_block_checksum is TRUE db_block_size) bytes. by default in Oracle 9i. Disk sorts − sort_write_buffer_size − Backup Schemes bytes in Oracle 7, dynamically sized in Oracle 8i. Time spent in HOT BACKUP impacts aggregate demand from log and log archive writes, Datafile creation − writes are issued in because entire DB blocks must be logged for ccf_io_size bytes in Oracle 7, or tablespaces that are in BACKUP mode. db_file_direct_io_count (DB blocks) in Oracle 8. Data snapshot technologies, such as Sun StorEdge™ Instant Image, Veritas Volume Some Oracle versions will revert the setting of Replicator (VVR), or Sun StorEdge™ 9900 large I/O parameters to 128 KB defaults ShadowImage may be directed at reducing this without warning. The best way to confirm the impact. actual I/O size being used by an Oracle process is to use the Solaris truss(1) command. 3. I/O Supply Topics Setting large values of Oracle’s db_file_multiblock_read_count has the This section is an annotated survey of various side effect of biasing the optimizer’s preference I/O supply factors in the I/O stack from the towards using full scans. Because of this, there bottom−up. High impact topics spring up at all may be tradeoffs between different queries. layers of the stack. Much as a chain is only as One workaround to this is to implement large strong as its weakest link, I/O supply can only reads at the session level (ie: ALTER SESSION) be as good as the worst bottleneck will allow. so that the impact is not global. Supply Characteristics Memory constrained systems may suffer from having excessive amounts of memory locked Some additional terminology is useful for down due to active large I/O operations. For discussing I/O supply factors. these reasons, the desirability of exploiting large I/O will vary. − Latency vs. Bandwidth As a guideline, DSS and DW systems will tend Defined simply, latency is the time required for to enjoy large reads, while OLTP systems may a single operation to occur, while bandwidth not. Using large reads in conjunction with wide characterizes the realized rate of data transfer. disk striping is a common formula for Both latency and bandwidth will vary in part optimizing sequential performance. Tuning with the size of the requested operations. read size to exploit filesystem and hardware pre−fetch characteristics is another means of In latency terms, there are three main improving realized bandwidth. approaches to I/O performance issues: 1. Reduce baseline latency, such as by adding − Data Checksumming cache, improving cache retention strategies, Oracle allows round−trip data integrity or by a baseline technology upgrade. validation by generating checksum data for 2. Improve work per operation, such as by each data block written, and validating the using a larger I/O size. checksum each time data is read. While the 3. Increase concurrency, such as by adding algorithm used is extremely efficient, it does channels or disks, spreading demand across impose some additional latency on every read available resources, or placing more and write, and therefore has a slight throttling demands upon existing resources. effect on demand. Also, for each read satisfied 7
  • 8. Opportunities for these manipulations may be Volume reconstruction, mirror re−silvering presented at various layers of the I/O stack. Physical data reorganization − Variance Transient utility operations Other hosts in a SAN In the case of a single simple disk, variance in Data services (eg: volume copies) service times is explained primarily by Committal of previously cached write data mechanical factors, such as rotational speed and seek time. In the absence of hardware failure, Even when these activities occur ’off host’, they one could say that with such a simple system, can compete for head movements and decrease the variance is actually bounded by these well− the net I/O supply to Oracle, both in terms of known mechanical factors. In real systems, IOPS and latencies. sources of variance are diverse, and include: − Throttles Process prioritization (both for LWP threads and simple UNIX processes). Anything that extends the code path for an I/O implicitly adds to I/O latency. Compared to Geometry−based wait queue sorting. other factors, marginal code path costs are Cross−platform competition in SANs. relatively minor in practice. Some components Cache utilization. in the I/O stack explicitly enforce certain Bugs in the I/O stack. limitations. Variance is infrequently measured and characterized, but no advanced math or Supply Factors − Bottom Up statistics is needed to characterize the most basic measure of variance − range. It is one Each layer in the I/O supply chain can thing to observe a range of I/O service times in contribute significant complexity. This survey the range of, say, 5 to 500 milliseconds, but of the I/O stack discusses a limited set of issues quite another to observe variance on the range at each layer. of 5 milliseconds to 5 minutes! − The Disk Spindle Concern over variance might be summarized The ultimate limiting factors to sustainable like this: IOPS in a complex system are the number of 1. Sources of variance should be identified disks and their operational characteristics. and scrutinized. Effective cache utilization at the system and 2. Sources of unbounded variance are bad storage subsystem layers may increase the and beg to be eliminated. supply of logical IOPS, but when caches and buses are saturated, it is the humble disk 3. Sources of bounded variance beg to be spindle that sets the speed limit. purposefully controlled. The primary disk characteristics are rotational I/O statistics are most often cited as averages speed and seek time. Disks vary in many other and sums, and most folks expect variance to be ways though, including number and type of ’reasonable’ about the mean. Certainly, ORA− access paths; quantity of onboard cache and 27062 incidents are categorically examples of intelligence of onboard cache management; use when this fails to occur. of tagged commands; and sophistication of − Competition error recovery logic. From time to time, significant disk performance issues may be I/O supply to Oracle is net of any competition. discovered in disk firmware. In other words, one can say that the supply of I/O to Oracle is diminished by competing I/O − Storage Subsystems demands involving the same resources. The range of features available from storage Possible sources of competition are numerous, subsystems has mushroomed in recent years. and some are easily overlooked. Consider The spectrum of subsystems ranges from the these: Backup and recovery operations 8
  • 9. basic JBOD7 array to sophisticated RAID 9900 series arrays, may predict the likely failure subsystems and Storage Area Networks of a particular disk, and evacuate its data far (SANs). Even humble JBOD arrays have more efficiently than by RAID 5 reconstruction. developed sophisticated features such as It is sometimes overlooked that in the absence environmental monitoring and fault isolation of writes, RAID 5 read performance should be capabilities. expected to be comparable with RAID 0 Intelligent arrays can take complexity off the performance, since RAID 5 parity data is host and facilitate new data management usually never referenced unless a disk fails8. paradigms, such as off−host backups, contingency copies, and sophisticated disaster − Storage Subsystems: Caching recovery strategies, in addition to providing Common to hardware RAID implementations is basic RAID features. For this discussion of I/O the use of persistent cache memory. Cache supply, only a small subset of these features will management in external RAID offers the be covered, including, RAID, caching, and data possibility of enhancing I/O supply to attached services. systems variously by caching writes or by facilitating pre−fetching or other intelligent data − Storage Subsystems: RAID retention schemes. When implemented in an integrated subsystem, When external cache is disabled or saturated, RAID features are variously called ’hardware subsystem performance as seen by attached RAID’, ’box−based RAID’, or ’external RAID’. hosts will degrade. External cache will typically In reviewing various RAID levels, an essential cease to be used for caching writes if its battery point to note is that ’striping’ across disks is backup becomes dubious, but read caching is essential to achieving sustainable bandwidth typically not impacted when cache persistence greater than that of a single spindle. RAID 0 is becomes questionable. Control over caching merely striping across disks with no data policies, such as pre−fetching or random read protection features. RAID 5 and similar data retention, varies tremendously between schemes also involve striping across disks. different subsystems, and are often controllable on a per−LUN9 basis. In the case of RAID 5, since parity must be generated and saved to disk, write activity will One simple fact that is often overlooked in have the effect of causing additional mechanical evaluating empirical I/O statistics is that writes activity that may compete with read requests. which appear very low−latency to a host are not Sophisticated integrated storage subsystems truly complete until they are flushed from the like the Sun StorEdge™ T3 or the Sun external cache. This flushing activity can StorEdge™ 9900 series arrays, are designed to compete with new I/O requests for head minimize this impact by intelligent I/O movement and therefore increase latency, scheduling. especially for new reads. Correlation of current I/O activity with recent writes may be difficult In the event of a complete disk failure in a with host−based tools, but one should keep in RAID 5 set, all disks in the set must be read to mind that "what goes up, must come down". reconstruct lost data, and that process can decrease the net supply of IOPS quite seriously. It is widely considered that RAID 5 performs Once a replacement disk is installed, the process less well on writes than alternatives such as of rebuilding a failed spindle involves reading RAID 1. While this may be true for host−based all of the surviving disks completely, and that RAID 5, it is not categorically true for can produce significant competition with user hardware−based RAID solutions. So long as demand. RAID 5 implementations typically write demand remains at levels that do not feature some means of tuning the rebuild rate to saturate the cache, all RAID levels look avoid unacceptable competition with live workloads. Some more sophisticated RAID 5 8 An exception would be ’parity checking’ implementations, such as in the Sun StorEdge™ operations implemented in some subsystems. 9 A LUN, or Logical Unit Number, is the term 7 JBOD means ’Just a Bunch Of Disks’, and has most often used to designate a virtual disk managed come to be common terminology in the industry. by an intelligent storage subsystem. 9
  • 10. essentially the same to attached hosts so far as cylinders of disks contain more data10, and writes are concerned. That is, writes to cache therefore provide slightly better performance will be very low latency until and unless the than the innermost cylinders. Exploitation of cache becomes saturated. this characteristic is sometimes called ’radial optimization’. − Storage Subsystems: Data Services In practice, disk layout is something of an art, Persistent cache in external arrays can also be though its underpinnings are all scientific. It leveraged in implementing efficient box−based can be helpful, if not essential, to construct data services, such as Sun StorEdge™ 9900 drawings of available disks, channels, and their ShadowImage and Sun StorEdge™ 9900 usage to be able to comprehend any given disk TrueCopy. As with host−based data services, layout. Especially when a great deal of the these can contribute some competition and complexity may be housed in an intelligent variance to the I/O stack. Any synchronous storage subsystem, effective visualization of the data forwarding product can inherit issues may otherwise be impossible. tremendous variance from the Quality of Service (QOS) of the underlying data link. − Disk Interconnect − Layout Strategies The most conspicuous characteristics of a disk interconnect are its signaling rate and its Whether or not an intelligent storage subsystem protocol overhead. 100 MB/sec (1 gigabit per is involved, placement of data across available second) Fiber Channel Arbitrated Loop disks can have significant impact on the actual (FC/AL) interconnects are now the supply of I/O. Regardless of how it comes to commonplace building blocks for modern SAN be, if two ’hot spots’ come to reside at opposite architectures. Speeds of 2 Gbit/sec and beyond ends of the same disk spindle, performance will are on the horizon, as are alternate protocols suffer. With modern SANs, competition for such as iSCSI and Infiniband. head movement may even originate from different hosts. Besides the obvious bandwidth limitations imposed by the native signaling rate, there are The trend toward higher disk densities and two aspects of FC/AL that deserve special lower IOPS per disk gigabyte has led to mention. First, as a loop−based protocol, some substantially different layout strategies from scaling issues can arise if too many targets share just a few short years ago when use of many the same loop. Second, due to the ’arbitrated smaller spindles was the norm. Right or wrong, loop’ aspect of FC/AL, its performance can the most popular strategy a few years ago was degrade rapidly with distance. For this reason, to segregate I/O by class, whereas now the SAN switches employing more efficient Inter− more popular strategy is far more bias towards Switch Link (ISL) protocols may be required to interleaving I/O classes. effectively extend FC/AL over distances. Oracle has promoted a "Stripe and Mirror Alternate interconnect paths can variously serve Everything" or "SAME" approach, while other to add resilience to a system or be exploited for writers have presented similar concepts performance. Management of alternate paths is variously as "Wide−Thin Striping" or "Plaid" implemented in several software products layout strategies. The common underlying above the interconnect layer. Multiplexed I/O notion is to distribute all (or most) Oracle I/O (MPXIO) provides OS level awareness of uniformly across a large spindle population, multiple paths as an option for Solaris OE 8. thereby distributing demand across available Veritas Volume Manager provides a Dynamic spindles and minimizing the likelihood of Multipathing (DMP) capability. Alternate paths excessive queuing for any single disk. These to a Sun StorEdge™ 9900 series array can be strategies are often implemented using a managed with Sun StorEdge™ 9900 Dynamic combination of hardware−based RAID and Link Manager. Each of these products has its host−based RAID. Another consideration commonly made in layout strategies is the fact that the outer 10 This is due to variable formatting known as Zone Bit Recording, or ’ZBR’. 10
  • 11. own feature set and capabilities with respect to controls should be manipulated only with a full I/O supply and error handling. understanding of the desired objective and the possibility of unintended consequences. − Interface Drivers One aspect of the sd layer that is presently not A step above the host bus adaptor (HBA) adjustable is logic that continuously sorts the hardware is a device driver software stack. disk wait queue according to the geometry of This stack begins with HBA−specific drivers the underlying disk. While this is quite useful (eg: esp, isp, fas, glm, pln, soc, with direct attach JBOD disks, the geometry of socal), which are integrated using the Device external disk subsystems presented to the OS is Driver Interface (DDI) framework. Higher level usually purely fictional. Write caching and I/O functions are provided by target drivers like sd scheduling in external RAID subsystems make (for SCSI devices) and ssd (for FC/AL devices). this sd feature of dubious value, and it could Some key quotas and controls for disk I/O are conceivably introduce undesirable variance to implemented in the sd and ssd drivers. response times. The ability to control this may Despite the close relationship between the sd appear in a future release of the Solaris OE. and ssd drivers, their controls are distinct. − In−host Write Caching Generally speaking, variables discussed here for the sd driver are the same in the ssd driver, The Sun StorEdge™ Fast Write Cache (FWC) but are named with ssd as the initial letters. (see [Sneed, 2001]) provides an in−host redundant and persistent write cache which can Error recovery parameters, such as greatly reduce the write latency seen by Oracle. sd_retry_count and sd_io_time, control While FWC imposes a throttle on bandwidth the time required for the driver stack to return and is incompatible with cluster environments, control to higher layers of the I/O stack in the its reduction in latency can be quite beneficial to event of hardware failures. By default, these processes which depend on serially dependent allow some failures to take up to five minutes to writes on the critical path to transaction be reported to upper layers. Varying these completion, such as Oracle’s log writer. controls can be tricky business, as they also pertain to the resilience of systems during Another means of in−host write caching is use power up operations. This is a topic that begs of cached RAID interface cards, such as the Sun for further development. StorEdge SRC/P Intelligent SCSI RAID The maximum number of concurrent requests Controller™ System. that can be passed to the HBA driver is − Host−based Data Services controlled by sd_max_throttle. While certain HBA hardware or attached storage may A wide variety of host−based data services are be limited with respect to its capacity for available, including Sun StorEdge™ Instant concurrency, this setting has systemwide Image (II), Sun StorEdge™ Network Data impact which can have adverse performance Replicator (SNDR), Veritas Volume Replicator implications. By default, this value is 256, (VVR), and an assortment of filesystem which maximally exploits SCSI tagged command snapshot facilities. These categorically impact queuing capabilities. aggregate physical I/O demand, and generally introduce some variance, throttling, and Several systemwide SCSI features can be temporal dependencies in maintaining data controlled by the bitmask variable structures such as bitmaps. scsi_options, including tagged command queuing and control over the maximum transfer As with their externally−based counterparts, rate which can be negotiated for SCSI transfers. synchronous remote replication products can Control over specific interfaces and targets may introduce significant or pathological throttling be available at the HBA driver level, such as and variance depending on the Quality of described in glm(7D). It would appear that the Service (QOS) of the network interconnects. scsi_options parameter is frequently Nevertheless, as a class, these products manipulated by third party storage vendors, effectively facilitate a wide range of cost perhaps often when it should not be. These effective business solutions. System designs 11
  • 12. should incorporate appropriate considerations One VxVM issue to which ORA−27062 to assure success with these products. For incidents were explicitly tied pertained to example, cached and/or dedicated storage systems using Dirty Region Logging (DRL’s) might be allocated for bitmap data structures. with mirrored volumes11 under heavy load. An I/O could get ’hung’ awaiting a DRL update, − Volume Management thus leading to extreme variance. Volume managers, such as the Veritas Volume Another recent issue was noted in VxVM 3.0.1 Manager (VxVM) or Solstice DiskSuite™ (SDS) where the default read policy for mirrored data provide myriad techniques for manipulating was changed from a prior default that tended to I/O supply. The code path overhead of Volume favor one side of the mirror for a series of Management is usually considered to be so sustained reads12. Under the prior default with slight as to be insignificant, but several topics in JBOD arrays, sustained reading would benefit volume management can have extreme from revisiting the spindle cache of the chosen significance. Volume Managers provide host− disk. This issue surfaced in terms of reduced based RAID capabilities, which is where many supply of sequential performance. layout decisions are typically implemented. − Filesystem Cache Host−based RAID 0, RAID 1, and hybrids of the two actually impose very little overhead Simply put, filesystem caching is bad for relative to the native capabilities of the attached synchronous writes, but good for reads that are storage. Host−based RAID 5, which is satisfied from it. That being said, it may not be implemented at the Volume Manager layer, is simple to determine the optimal filesystem entirely another matter. Since host memory is caching scheme for a given Oracle workload. not persistent like the caches of hardware RAID Filesystem cache performance and its subsystems, RAID 5 parity data must be flushed contribution to Oracle performance can vary along with each synchronous write in realtime, greatly depending upon caching options used and this is typically disastrous relative to and the amount of memory available. With 32− database write requirements. On the other bit Oracle, use of the filesystem cache is a hand, some very large DSS databases might principal means of exploiting large server rationally employ economical host−based memories. RAID 5 for predominantly read−oriented tasks In the Solaris OE, the filesystem cache is not a where the write penalty is not too severe with fixed size area, but rather lives in the OS page respect to the rate of data updating. Host− cache, so the terms are used interchangeably in based RAID 5 exhibits very good read this discussion. See [Mauro & McDougall, bandwidth. 2001] for more information on Solaris memory Due to the popularity of VxVM on Sun™ management. systems, VxVM issues tend to have a high rate The ’extra copy penalty’ often cited against of impact when they occur. Therefore, a few using filesystem cache is not highly significant details specific to VxVM are offered here. in the overall scheme of things. Copying data in With mirrored volumes implemented in VxVM, memory is an operation at which modern Sun™ recovery time after an abnormal system systems are particularly adept. shutdown can be greatly reduced by implementing Dirty Region Logging (DRL’s). It − Filesystem Block Size is generally recommended to group all DRL Since Oracle I/O is typically in areas for a system on a few disks dedicated to db_block_size increments and 8 KB blocks that purpose. When properly implemented, or larger are becoming the norm, it is usually DRL maintenance overhead is often advised to use a VxFS filesystem block size of characterized as reducing write supply by about 8 KB. Smaller block sizes will decrease supply 10%. Without DRL’s, recovery after a crash can and cost in terms of CPU usage when used on require complete resilvering of disrupted mirrors. 11 See Bug Report #4333530, fixed at VxVM 3.1 and by current patch to version 3.0.4. 12 See Bug Report #4255085, fixed at VxVM 3.1. 12
  • 13. Oracle data filesystems that never see I/O Using VxFS Quick I/O (QIO) requests smaller than 8 KB. Using the UFS concurrent forced direct I/O By default, VxFS selects the filesystem block feature (forcedirectio) available in size at the time of filesystem creation based on Solaris OE 8, Update 3 or later. filesystem size, without regard for its intended Using the Sun™ QFS filesystem Q−Write use. Therefore, one must explicitly specify a feature. block size option to mkfs13 to assure consistent For users on filesystems, implementing VxFS results. QIO, UFS forcedirectio, or Sun™ QFS Q−Write On UFS, 8 KB is the only blocksize permitted on requires no movement of data. In contrast, modern Sun™ systems. conversion from filesystem to RAW would require completely offloading and reloading the − Filesystem Throttles data, as would conversion between UFS has tunable thresholds at which it will filesystems14. suspend and resume processes with Other points to ponder about these options uncommitted deferred write data, and thus include: prevent a single process from overrunning the page cache or using an inequitable share. A QIO and RAW share the advantage of ’high−water mark’ (UFS:UFS_HW) and ’low− exploiting the Kernel Asynchronous I/O water mark’ (UFS:UFS_LW) implement this (KAIO) code path, which is capable of throttle. supplying more IOPS than the LWP AIO code path. A similar throttle is introduced with VxFS Version 3.4. Absent such a throttle, the QIO offers the ability of per−file control of importance of aggressive Virtual Memory filesystem caching (’Cached QIO’) and the Management with deferred buffered I/O was ability to observe read hits rates in the OS magnified (see [Sneed, 2000]). page cache, while UFS forcedirectio categorically disables OS buffering. The mere presence of a filesystem code path has some I/O throttling effect relative to RAW disk, QIO introduces some operational complexity but this turns out to be a very minor factor in relative to UFS forcedirectio. most cases. QIO is a separately licensed feature, while UFS forcedirectio comes with the Solaris OE. − Filesystem Single−Writer Lock Avoidance of the single writer lock and use of Perhaps the single most significant factor that the KAIO code path are known techniques for limits Oracle write throughput with filesystems reducing the probability of ORA−27062 is the "single writer lock" constraint required by incidents. Each option for doing this has its POSIX standards to assure proper ordering of own pros and cons. synchronous writes. Increased write concurrency at the Oracle level is for naught − Asynchronous I/O (AIO) when the filesystem serializes the writes. The The Solaris OE offers two alternate AIO impact of the single writer lock is reduced with Application Programming Interfaces (API’s). externally cached storage, which provides low One conforms to the POSIX Real Time (RT) write latency and correspondingly lowered lock specifications (eg: aio_read(3RT), dwell times. aio_suspend(3RT), et al) implemented in Since Oracle handles its own write ordering librt(3LIB), and the other conforms to considerations, the single writer lock common SUNW private specifications (eg: to standards−conforming filesystems is both aioread(3aio), aiowait(3aio), et al) unnecessary and injurious to Oracle I/O implemented in libaio(3LIB). performance. There are several ways to avoid Oracle uses the SUNW API. the single writer lock altogether, including: Using RAW disk volumes, LUNs, or slices. 14 In some cases, tools may exist to allow in−situ 13 See mkfs_vxfs(1M). conversion of filesystems. 13
  • 14. The SUNW implementation in libaio includes (also called a ’LWP’ or ’thread’), and that these a hard−coded throttle called _max_workers threads have their priorities shuffled in the which limits the number of active requests per same manner as other threads in the TS process. This value is set to 25615. A process scheduler class. This can give rise to variance in can queue requests past this limit subject to response times seen by Oracle, and with available memory, but queuing far past this extreme loads, this variance can be pathological. limit offers no tangible benefit. Applied incorrectly, priority manipulations can For more information on SUNW AIO lead to seriously adverse outcomes. implementation and usage, see [Mauro & McDougall, 2001] and [McDougall & Conclusion Gummuluru, 2001]. Hopefully, some of the perspective and details Kernel Asynchronous I/O (KAIO), which is offered here will be found to be useful in used on RAW and VxFS QIO files, is often cited applied terms. as yielding 10−12% gains on extremely write− intensive benchmark workloads. However, for Each layer of the I/O stack can be meaningfully many workloads its impact may be negligible, discussed in terms of its impact on the supply especially considering that many database of I/O to the layers above it. I/O demand workloads are either extremely read biased or results in an equilibrium that depends on more constrained by bandwidth than write numerous supply factors. This perspective can concurrency. be useful in comprehending the complexity of the I/O stack, and in understanding I/O As mentioned earlier, it is not advisable to use options and tradeoffs. too many DB writer processes when AIO is used. References − Process Scheduling Oracle publications are generally available Last but not least, Oracle processes cannot online via the Oracle Technology Network generate I/O demand unless they are in (OTN) Web site at http://otn.oracle.com. memory, ready to run, and not deprived of Membership is free. needed CPU and memory resources. Aside from the topics of having the CPU and memory [ORA_9i_PER] − "Oracle 9i Database appropriately sized for the workload, this Performance Guide and Reference", Oracle relates to several system management topics Corporation Part No. A87503−02, 2001. including Virtual Memory Management (VMM) [ORA_9i_TUN] − "Oracle 9i Database configuration16 and manipulation of process Performance Methods", Oracle Corporation priorities. Part No. A87504−02, 2001. Process priority manipulation is a rather [Alomari, 2000] − "Oracle 8i and UNIX complex topic in the Solaris OE, and a wide Performance Tuning", Ahmed Alomari, assortment of tools and techniques are Prentice Hall, September 2000, ISBN addressed at this topic. dispadmin(1M)and 0130187062. related topics from the ’SEE ALSO’ section of the dispadmin man page are the main tools for [Mauro & McDougall, 2001] − "Solaris manipulating priority schemes. Other facilities Internals", Jim Mauro & Richard McDougall, include the venerable nice(1) command; Prentice Hall, 2001, ISBN 0−13−022496−0. processors sets (psrset(1M)); and the [YAPP Method, 1999] − "Yet Another separately licensed Solaris Resource Manager™. Performance Profiling Method (YAPP It is noteworthy that each AIO request to Method)", a whitepaper by Anjo Kolk, Shari filesystem files involves a lightweight process Tamaguchi & Jim Viscusi of Oracle 15 Corporation, 1999. This was 50 prior to Solaris OE 2.6. 16 See [Sneed, 2000] for more information on [McDougall & Gummuluru, 2001] − "Oracle VMM tuning and Intimate Shared Memory (ISM) Filesystem Integration and Performance" − a usage with Oracle. 14
  • 15. whitepaper by Richard McDougall and Sriram Purcell (Sun) for their review and feedback. Gummuluru or Sun Microsystems, January Thanks also to Janet, my wife and volunteer 2001. editor. This document was created using StarOffice™ software. [Sneed, 2001] − "Sun StorEdge™ Fast Write Cache Application Notes" − a whitepaper presented at the April, 2001 Sun Users and © 2001 Sun Microsystems, Inc. All rights reserved. Performance Group (SUPerG) conference. Sun, Sun Microsystems, the Sun logo, Solaris, Solaris Resource Manager, StarOffice, Sun Blueprints, Sun [Sneed, 2000] − "Sun/Oracle Best Practices" − a QFS, Sun StorEdge, Sun StorEdge Instant Image, Sun Sun Blueprints™ Online article available at StorEdge Network Data Replicator, Sun StorEdge T3, http://www.sun.com/blueprints/0101/SunOr Sun StorEdge 9900, Sun StorEdge 9900 acle.pdf. ShadowImage, Sun StorEdge 9900 TrueCopy, Sun StorEdge Fast Write Cache, Sun StorEdge SRC/P Intelligent SCSI RAID Controller System, and Acknowledgments Solstice DiskSuite are trademarks or registered trademarks of Sun Microsystems, Inc. in the United Special thanks to Geetha Rao (Veritas), Vance States of America. All SPARC trademarks are used Ray (Veritas), Yuriy Granat (Oracle), and Jamie under license and are trademarks of SPARC Vuong (Oracle) from the Veritas/Oracle/Sun International, Inc. in the United States and other Joint Escalation Center (VOS JEC) for all their countries. Products bearing SPARC trademarks are contributions to the VOS JEC. Thanks to Jim based upon an architecture developed by Sun Viscusi (Oracle), Rey Perez (Sun), and Elizabeth Microsystems, Inc. 15