Infrastructure Issues – Troubleshooting and Remediation – SOA-C02 Study Guide

Infrastructure Issues

Generally, you should follow AWS best practices and deploy any unmanaged system across two availability zones, as discussed in Chapter 1, “Introduction to AWS.” Anytime you deploy, you expect the infrastructure to just work. If you have a deployment issue, you can easily resolve the issue by trying a redeployment. We call this approach the “rinse and repeat” approach. The approach is useful both for the initial deployment as well as development testing and upgrades.

Troubleshooting infrastructure issues in AWS then usually refers to resolving either a misconfiguration or a fault that was introduced from the configuration or deployment side. For example, let’s say you deploy a new version of a security group for EC2 instances. The new version was incorrectly configured, and due to the incorrect configuration, the application front end becomes unavailable. Errors don’t just happen when performing manual changes; they can leak into scripts, templates, and automation configurations.

You can also set up infrastructure triggers in CloudWatch Alarms to try to prevent any possible issues with the infrastructure or detect any unusual metric and try to preemptively remediate. For example, you can monitor the state of these instances, and if any health issues are detected, you can perform automatic remediation. Many different factors can constitute infrastructure health issues, including but not limited to

 EC2 instance health check failure: All instances have an automatic health check configured. This can be monitored with CloudWatch, and you can create an alarm that informs you of any issues of this type.

 Change in number of EC2 instances: An availability zone failure could cause the number of reachable instances in an EC2 environment to drop suddenly. You can track the number of active instances and perform remediation. This is usually done through autoscaling. The number could also increase dramatically due to a runaway automation script. That is also an important factor to monitor and alert on.

 Infrastructure performance: The performance of instances, networks, and disks can also indicate issues. You should always set thresholds for performance alarms that encompass the range of performance expected within your environment and trigger alarms when performance metrics are out of spec.

ExamAlert

Infrastructure performance is correlated with the performance of the application, and the application security is coupled with the overall data security. An infrastructure performance alarm often can be triggered by an application or security issue. The exam focuses on a holistic view of troubleshooting, so expect to see questions that include all levels of troubleshooting and remediation in one.

Application Issues

After you set up the infrastructure monitoring and alarming, you need to deal with the application layer. Tracking internal application metrics and logs and creating alarms to respond to should be done in the same exact manner as with the infrastructure. The application often can trigger an infrastructure issue; for example, an infinite loop in code can cause a CPU spike to 100 percent. This means that when troubleshooting your application, you should not expect it to “just work,” and you need to compare the aforementioned collection of monitoring and logs to the infrastructure data to determine if the issues are infrastructure related or stem from the application itself.

The simplest practice for metric and log collection when running your application on EC2 instances or on-premise servers would be using the CloudWatch agent. The agent can collect data from any source within the operating system and forward that data to CloudWatch as metric or log data. An even better approach is coding API calls to the CloudWatch API within the application code so that the application is able to self-report metrics regardless of the environment where it runs.