When designing any system for high availability, a number of questions/concerns are typically addressed, such as the following:
- What types of failures should a system be able to sustain?
- How many failures should a system be able to sustain?
- What steps (manual or automatic) need to be executed to ensure availability?
- What systems or processes can we put in place to avoid interruptions in the first place?
These types of questions speak to the concept of dependability. A dependable system is one that is available to service a request and is able to continue serving requests despite failures of the component architecture (such as a server or network device) or supporting services (such as electricity). Dependability has six core attributes:
- Availability: Measures the system's readiness to accept and respond to new requests for service
- Reliability: Measures how a system can continue to operate after an unexpected event
- Safety: Measures a system's level of risk to users and the environment
- Confidentiality: The ability to control or prevent unauthorized disclosure of information
- Integrity: Measures the presence or absence of an improper system alteration (such as data corruption)
- Maintainability: A qualitative measurement for how easily a system is kept current, repaired, or updated
When designing a system, these ideas or attributes of dependability can be used when building a Fault-Error-Failure chain to help identify potential errors and solve them before they are expressed during operation.
From a practical standpoint, these questions of dependability can be broken up into four main categories:
- Fault forecasting
- Fault avoidance
- Fault removal
- Fault tolerance
Let's examine each of these with regard to designing a highly available SharePoint Server environment.
Fault forecasting
Fault forecasting is the prediction of likely or potential failures. With respect to SharePoint Server architectures, some of the following components come to mind:
- Server hardware, including components such as memory, chassis, power supplies, or mainboards
- Storage hardware, including components such as disk drives or other storage media, storage array software or firmware, or disk controllers
- Networking, including device (switch, router, firewall, proxy, and load-balancers) and cabling components, and inbound and outbound connectivity to the internet or other sites
- Power, including any power cables, switch boxes, outlets, power strips, uninterruptible power supplies, building or site power, and redundant power generation
- Software, such as application binaries or updates, Secure Sockets Layer (SSL) certificates, operating system binaries or updates, database servers, application services, and components
Each of those component categories represents one or more potential failures for an environment. In the forecasting stage, it's important to determine as many things as possible that can go wrong, as well as the likelihood and service impact of each.
Faults will happen in any environment, so devising strategies to identify potential faults and their impacts will help you design highly available systems.
Fault avoidance
Once potential faults in architecture have been identified, you can design around them. The premise of fault avoidance (or fault prevention) is to introduce elements that prevent faults. In the context of SharePoint Server architecture, this can mean several things, such as the following:
- Rigorous change control processes to understand modifications being made to the environment
- Development, test, or other sandbox-style environments where modifications are made and evaluated prior to production deployment
- Automated or scripted procedures to reduce the opportunity of human-caused failures
- Planning for redundancy and multiple failure modes
Fault avoidance is critical from both the design and operational perspectives to help ensure a high level of service and availability for a given service or application.
Fault removal
The goal of fault removal is to reduce the number and severity of service faults. Fault removal activities can be broadly divided into two categories:
- During the planning, design, or development of a system
- During the operation of a system
From a SharePoint Server perspective, removing faults during the development or planning of a system is the iterative process of identifying potential faults, such as disk drive or database failure (fault forecasting), designing a system to mitigate or prevent them (fault avoidance), and then performing testing that would trigger a particular failure mode.
For example, if you are planning for disk drive failure in a storage array, you would do the following:
- Implement a storage subsystem with redundant features, such as disk mirroring.
- Deploy an application or service utilizing the storage subsystem.
- Introduce a failure, such as removing a disk drive, that would normally trigger a system failure.
- Verify that the application or service continues to operate.
If the service or application fails to continue operating, you need to review the error logs and conditions, revise the deployment methodology or design, and then repeat the testing. Through this process, you can provide assurance to the business that the system will perform as designed.
Addressing the concept of fault removal during operation, using the previous example of disk drive failure, might look something like this:
- The disk in the storage subsystem fails.
- The disk subsystems continue operating in a degraded state.
- The technician replaces the failed disk.
- The system returns to a normal operational state.
In the preceding example, Step 1 is the failure mode. Step 2 indicates that the system's design has successfully resulted in continuing operations. In Step 3, the technician is performing fault removal by removing a failed device and replacing it with an operational one. In Step 4, the system has recovered and has returned to a normal operating state, free of faults.
In the previous failure scenario, the disk subsystem may have been designed to sustain the failure of a single disk drive. After the disk has failed in Step 1, the system is then at risk until the disk has been replaced in Step 3. The ability for a system to continue operation is compromised with each further fault, so it's important to minimize the amount of time between the steps.
Fault tolerance
Finally, the design goal of fault tolerance is to address how systems react when faults happen. As we've already stated, faults will happen. Fault-tolerant design plays a crucial role in allowing services to continue while faults are removed.
As a practitioner, you'll often be faced with choices and trade-offs to make on fault-tolerant designs, such as spending resources on redundant database hardware or additional servers in the SharePoint Server farm.
When designing highly available, fault-tolerant design for SharePoint, you'll likely need to incorporate the following components:
Fault Domains |
Examples |
Rack and power infrastructure |
Server racks, power distribution units, power circuits, uninterruptible power supplies, fans, and cooling equipment |
Physical server infrastructure and components |
Servers, server chassis, server backplanes or midplanes, hard disk drives, controllers, network interface cards, and processors |
Virtual server infrastructure and components |
Virtual machine hosts |
Network infrastructure and components |
Rack-based switches, cabling, core switching, load balancers and traffic directors, and firewalls |
Storage infrastructure and components |
Storage networking components, disk arrays, disks, disk controllers, and Redundant Array of Independent Disks (RAID) settings. |
Application services and components |
SharePoint application servers, Distributed Cache servers, User Profile Service, and the Search Service application |
Database services and components |
The SQL Server database failover clustering or AlwaysOn availability groups for content, configuration, and service application databases |
In the fault forecasting step, you identified potential failures that could affect the SharePoint Server system and designed methods in the fault avoidance step to help mitigate or reduce the impact of the faults on the environment.
In addition to fault-tolerant designs, you also need to make preparations for how to recover from catastrophic failures (such as a natural disaster) that spans all components in either a single fault domain or multiple fault domains.
In the next section, we'll look at using highly available designs to mitigate the impact of failures of various service databases.
Supported SharePoint high-availability designs
A SharePoint farm has many moving pieces. A successful highly available design requires understanding how the various components can be made resilient. The following table lists the database design considerations:
Service Database |
Supports Database Mirroring for High Availability |
Supports Database Mirroring or Log Shipping for Disaster Recovery |
Supports SQL AlwaysOn Availability Group for Availability |
Supports SQL AlwaysOn Availability Group for Disaster Recovery |
Configuration database |
X |
X |
||
Central Administration database |
X |
X |
||
Content database(s) |
X |
X |
X |
X |
App Management database |
X |
X |
X |
X |
Business Connectivity Service database |
X |
X |
X |
X |
Managed Metadata Service database |
X |
X |
X |
X |
PerformancePoint Services database |
X |
X |
X |
X |
Power Pivot Service database |
X |
X |
X |
X |
Project Server database |
X |
X |
X |
X |
SharePoint Search Service – administration database |
X |
X |
||
SharePoint Search Service – analytics reporting database |
X |
X |
X |
|
SharePoint Search Service – crawl database |
X |
X |
||
SharePoint Search Service – link database |
X |
X |
||
Secure Store database |
X |
X |
X |
X |
SharePoint Translation Services database |
X |
X |
X |
X |
State Service database |
X |
|||
Subscription Settings database |
X |
X |
X |
X |
Usage and Health Collection database |
X |
X |
X |
|
User Profile Service – profile database |
X |
X |
X |
X |
User Profile Service – synchronization database |
X |
X |
X |
X |
User Profile Service – social tagging database |
X |
X |
X |
X |
Word Automation Services database |
X |
X |
X |
X |
One of the common threads you'll see in the databases' availability design is the support for SQL Server AlwaysOn availability groups. Microsoft recommends AlwaysOn availability groups for all databases in a SharePoint Server environment from the perspective of same-farm high availability.
Service Applications support high availability behind load-balancers. After using the SharePoint product configuration wizard to configure a role for your server, add a configuration object (such as a virtual IP) to your load balancer that includes all of the servers hosting an application or service.
While a fault-tolerant and resilient design is important from a design and day-to-day operational perspective, you also need a plan for business continuity concerns in the event of a significant problem. That is where disaster-recovery planning is helpful.