본문 바로가기

DB/ORACLE

다운타임의 원인과 해결책들

반응형

Oracle Database는 계획되거나 계획되지 않은 다운타임의 원인에 대한 문제를 해결할 수 있도록 설계되었습니다

다음은 시스템에 다운타임이 발생하는 다양한 원인과 솔루션을 정리한 오라클 매뉴얼의 일부 내용입니다.

 

1.Causes of Downtime

 

Category Outage  TypeDescriptionExamples
UnplannedComputer failureA computer failure outage occurs when the system running the database becomes unavailable because it has shut down or is no longer accessible.Database system hardware failure
Operating system failure
Oracle instance failure
Network interface failure

 
Storage failureA storage failure outage occurs when the storage holding some or all of the database contents becomes unavailable because it has shut down or is no longer accessible.Disk drive failure
Disk controller failure
Storage array failure

 
Human errorA human error outage occurs when there is unintentional or malicious actions committed that cause data within the database to become logically corrupt or unusable. The service level impact of a human error outage can vary significantly depending on the amount and critical nature of the affected data.Dropped database object
Inadvertent data changes
Malicious data changes

 
Data corruptionA data corruption outage occurs when a hardware or software component causes corrupt data to be read or written to the database. The service level impact of a data corruption outage may vary, from a small portion of the database (down to a single database block) to a large portion of the database (making it essentially unusable).Operating system or storage device driver, host bus adapter, disk controller, or volume manager error causing bad disk read or writes
Stray writes by operating system or other application software

 
Site failureA site failure outage occurs when an event causes all or a significant portion of an application to stop processing or slow to an unusable service level. A site failure may affect all processing at a data center, or a subset of applications supported by a data center.Extended site-wide power failure
Site-wide network failure
Natural disaster making a data center inoperable
Terrorist or malicious attack on operations or the site
PlannedSystem changesPlanned system changes occur when performing routine and periodic maintenance operations and new deployments. Planned system changes include any scheduled changes to the operating environment that occur outside the organizational data structure within the database. The service level impact of a planned system change varies significantly depending on the nature and scope of the planned outage, the testing and validation efforts made prior to implementing the change, and the technologies and features in place to minimize the impact.Adding/removing processors to/from an SMP server
Adding/removing nodes to/from a cluster
Adding/removing disks drives or storage arrays
Changing configuration parameters
Upgrading/patching system hardware and software
Upgrading/patching Oracle software
Upgrading/patching application software
System platform migration
Database relocation

 
Data changesPlanned data changes occur when there are changes to the logical structure or physical organization of Oracle database objects. The primary objective of these changes is to improve performance or manageability.Table definition changes
Adding table partitioning
Creating and rebuilding indexes

 

2.Oracle High Availability Solutions for Unplanned Downtime

 

Outage

Type

Oracle

Solution

BenefitsRecovery Time

Computer

failures

Fast-Start

Fault Recovery

Tunable and predictable cache recoveryMinutes to hoursFoot 1 
RACAutomatic recovery of failed nodes and instances, fast connection failover, and service failoverNo downtimeFoot 2 
Data GuardFast Start Failover and fast connection failover< 1 minute
Oracle StreamsOnline replica databaseNo downtime2

Storage

failures

ASMMirroring and online automatic rebalanceNo downtime

RMAN with flash

recovery area

Fully managed database recovery and managed disk-based backupsMinutes to hours
Data GuardFast Start Failover and fast connection failover< 1 minute
Oracle StreamsOnline replica databaseNo downtime2

Human

errors

Oracle security

features

Restrict user access as preventionNo downtime

Oracle Flashback

technology

Fine-grained and database-wide rewind capability< 30 minutesFoot 3 
LogMinerLog analysisMinutes to hours
Data corruptionsHARDCorruption prevention within a storage arrayNo downtime

RMAN with flash

recovery area

Online block media recovery and managed disk-based backupsMinutes to hours
Data GuardAutomatic validation of redo blocks before they are applied, execute fast failover to an uncorrupted standby database< 1 minute
Oracle StreamsOnline replica databaseNo downtime2

Site

failures

RMANFully managed database recovery and integration with tape management vendorsHours to days
Data GuardFast Start Failover and fast connection failoverSeconds to 5 minutesFoot 4 
Oracle StreamsOnline replica databaseSeconds to 5 minutes4

 

Footnote 1 Recovery time consists largely of the time it takes to restore the failed system.
Footnote 2 Database is still available, but portion of application connected to failed system is affected.
Footnote 3 Recovery time for human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, it typically only requires seconds to flashback the appropriate transactions. Longer detection time usually leads to longer recovery time required to repair the appropriate transactions. An exception is undropping a table, which is literally instantaneous regardless of detection time.
Footnote 4 Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.

 

3.Oracle High Availability Solutions for Planned Downtime

 

Maintenance
Type

Oracle

Solution

DescriptionRecovery TimeConsiderations

System and

hardware upgrades

RACTo avoid downtime:
  1. Dynamically redirect connections and services to a different instance.

  2. Shut down target instance.

  3. Upgrade target node while other nodes and instances are still available.

  4. Start node and instance. Repeat on another node.

No downtimeNeed to check for system restrictions.
Need to check if the database and clusterware versions are certified with the new system and hardware changes.

Operating system

upgrade

RACTo avoid application downtime:
  1. Dynamically redirect connections and services to a different instance.

  2. Shut down target instance.

  3. Upgrade operating system on target node while other nodes and instances are still available.

  4. Start node and instance. Repeat on another node.

No downtimeNeed to check if the database and the clusterware versions are certified for both operating system patch releases.

Oracle one-off

patches

RAC"One-off" patches—or interim patches—to database software are usually applied to implement known fixes for software problems, or to apply diagnostic patches to gather information on a problem. Such patch application is often performed during a schedule maintenance outage.

Oracle provides the capability to do rolling patch upgrades with RAC with little or no database downtime using the opatchcommand-line utility.

A RAC rolling upgrade enables at least some instances of the RAC installation to be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched needs to be disabled. The other instance can continue to remain available. This means that the impact on the application downtime required for scheduled outages is further reduced. Oracle's opatch utility enables the user to apply the patch successively to the different instances in a RAC installation.

No downtimeRolling upgrade is only available for patches that are certified for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:
  • Patches that do not affect the contents of the database, such as the data dictionary

  • Patches not related to RAC inter-node communication

  • Patches related to client-side tools such as SQL*Plus, Oracle utilities, development libraries, and Oracle Net

  • Patches that do not change shared database resources, such as datafile headers, control files, and common header definitions of kernel modules

RAC cannot be used for rolling upgrade of patch sets.

Storage

migrationFoot 1 

ASMASM enables you to add all disks in one storage array and subsequently drop all disks from another array. ASM will automatically rebalance and migrate data to the new storage while the database remains operational.No downtimeBefore removing the source storage array, ensure that the rebalancing is complete.

System and

cluster upgrades

Data GuardFor system upgrades that are not rolling upgradable with RAC due to system restrictions or cluster firmware upgrades that require downtime, leverage Data Guard to switch over to a physical or logical standby database:
  1. Issue Data Guard Switchover (only downtime component: optimally seconds to minutes).

  2. Shut down initial primary database (now standby).

  3. Execute system and cluster upgrade steps.

  4. Restart as standby database and allow recovery to synchronize.

  5. Optionally issue Data Guard Switchover to return to original database.

Seconds to minutesFor fastest switchover, the standby database should be using real-time apply and synchronized prior to the switchover operation.
Patchset and database upgrades

Data Guard

using SQL

Apply

Leverage Data Guard using SQL Apply to upgrade an Oracle database:
  1. Set up SQL Apply (logical standby database).

  2. Upgrade logical standby database to new release.

  3. Disconnect applications.

  4. Execute Data Guard switchover.

  5. Reconnect applications to the new primary database.

  6. Shut down initial primary database (now logical standby database).

  7. Execute database software upgrade steps.

  8. Restart the standby database and allow recovery to synchronize.

  9. Optionally issue Data Guard Switchover to return to the original database.

Seconds to minutesOnly supported for Oracle database versions 10.1.0.3 and higher.
SQL Apply has some data type restrictions. For more information, see Oracle Data Guard Concepts and Administration.

Database upgrades and platform

migration

Transportable

tablespace

Transporting a database only requires copying datafile and integration the tablespace structural information. Tablespaces can even be transported between databases from different releases. With Oracle database 10g, tablespaces can be transported across platforms.

To perform a database upgrade or platform migration:

  1. Create and prepare a separate database using the target release.

  2. Transport tablespace from primary database to target database. Only copy datafiles from the source to target if the databases are not on the same storage device.

  3. Prepare and open the new production database.

If the target database reside on a separate host but on the same platform, create a physical standby database from the initial primary database co-located with the target database. After a Data Guard Switchover, transport the tablespaces from the source to the target without incurring the file transfer time as part of the downtime.Foot 2 

Minutes to hoursTransportable tablespace has limitations and restrictions in regard to character sets, opaque types, and system tablespace objects. Unlike previous solutions, the steps are not automated.

Transportable tablespaces do provide the following benefits:

  • Provides an easier and more efficient means for content providers to publish structured data and distribute to customers running Oracle on a different platform

  • Simplifies the distribution of data from a data warehousing environment to data marts that are often running on smaller systems with a different platform

  • Enables the sharing of read-only tablespaces across a heterogeneous cluster

Oracle StreamsLike Data Guard using SQL Apply, Oracle Streams can capture database changes, propagate them to destinations, and apply the changes at these destinations. Oracle Streams is optimized for replicating data and can capture changes locally in the online redo log as it is written. The captured changes can then be propagated asynchronously to replica databases. This optimization can reduce latency and enable the replicas to lag the primary database by no more than a few seconds.

Unlike Data Guard using SQL Apply, Oracle Streams enables updates on the replica and provides support for heterogeneous platforms with different database releases. Therefore, Oracle Streams may provide the fastest approach for database upgrades and platform migration.

Seconds to minutes to hoursOracle Streams also has data type limitations and restrictions, such as for advanced queue and object types.

Oracle Streams implementations will require additional investment for setup and configuration since it is designed to be a more flexible architecture.


Footnote 1 An example is migration from traditional storage to low cost storage
Footnote 2 For more information, refer to the best practices white papers available at

http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm.

반응형