One of the jobs I had before I became a DevOps Engineer was a machine adjuster at a stationery company. I worked on a highly complicated machine with several distinct stages that made coiled notebooks. You can actually see the exact machine I worked on in this YouTube video, https://youtu.be/iAR6q3oIV78?si=8VeWIHKTaRFxi591&t=42.
One of the many lessons I learned when working on that machine was that the location that a problem is observed is not always where the source of the problem is located. A great example of this in code is when you have a null value in a variable. The error is logged when the variable is called. However the source of that null value could occur anywhere, from where the variable was initiated to a point in the code where it is storing a returned value.
This perspective has helped me quite a bit in my relatively short career. Many of my peers regularly request my help to resolve issues that they encounter. And even when my aid doesn’t result in an immediate solution. I routinely receive the praise that my perspective was instrumental in helping them find the root of the problem.With that being said I would like to recount a couple of situations that highlight how to troubleshoot an issue in DevOps.
Networking
Networking is probably one of the most difficult issues to troubleshoot as there are so many places where an issue can occur. Especially when utilizing containers. From the external firewall to the reverse proxy to the VM to the container. All are points of potential failure. And when running in AWS the Load balancer and AutoScaling are two locations that can be significant points of pain.
On one project it was decided that the applications would be ran by a less priviliged user. This means that the ports lower than 1024 are no longer available to the application. This generally isn’t a problem for web applications because instead of port 80 we can use port 8080 and instead of port 443 there is port 8443. But that would be only if traffic was allowed to pass through the network on those ports. When deploying the changes the team quickly realized there was a problem when the application failed the healthchecks repeatedly and a new container would be deployed every 3 minutes.
It was at this point I was brought into the investigation. I suggested rolling back to a previous image which was able to successfully deploy. At this point it was a matter of comparing what are the differences between the two images. When we discovered that the successful deployment was using different ports than the one that failed it was easy enough to look into the AutoScaling group and see that only port 80 was open. After successfully deploying the application by opening port 8080 on the AutoScaling group we now discovered that we were unable to reach the application. So we then checked the LoadBalancer and low a behold the same issue.
Those were issues that occurred outside of the container. But sometime the issue lays within. One such issue was when deploying a pod to a Kubernetes cluster, the pod was not reachable from the internet. The traceroute pointed to the correct endpoint. But no ping was returned. Another pod and application were running on the same cluster and was able to be reached successfully. When analyzing the pod specs nothing jumped out at us. The configuratoin appeared correct and without issues. I suggested we look at the helm charts of both applications to compare, so that we could get some idea of why one application was accessible and the other was not.
When we placed the helm charts up on screen side by side the root of the issue jumped off the screen and presented itself clear as day. Both applications were listening on the same port. Since the first pod deployed takes possession of the port the second pod would remain inaccessible. After updating the helm chart with a new port and redeploying the application the issue was rectified.
A few months later the application in the second example once again became inaccessible. There was no change to the Helm chart, no change to the application code. It was now time to check the logs. We checked the container logs and the nginx ingress logs. And…. nothing. No errors that would point to the cause of the issue. I requested that we force a redployment of the pod. Which resulted in no change. We then sifted through the logs again. The container logs held no answer. But when we looked into the ingress logs a certificate error popped up. It turns out the previous week the SSL certificate for all the applications of this particular project were renewed. However there was an oversight in updating the certificate for this application and the certificate was now expired. Once identified it was easy enough to issue a new cert and get the application back up and running.
The Build
The primary example given in this article pertaining to having the wrong value in a variable is probably the most frequent issue I have encountered in the build process. But none are more infuriating than the dreaded white space. Now many may not understand why unexpected white spaces are a plague in Bash scripting. But I will attempt to explain. It’s very insidious in it’s nature. White spaces are generally used in the shell to delineate inputs for a command, It can also delineate commands as well based on what the previous character is before the white space. Ultimately that is to say an unexpected white space alters the behaviour of your commands.
This is bad enough but the issue is compounded by the fact that the white space is nearly impossible to detect by normal means. When trying to establish that all the variables are holding the correct value, the go to method is to echo the value of the variable to screen. Should the white space be at the beginning of the value. It may be spotted by scrutinizing the indentation. But when it lies at the end of a value. There is no way to tell that it’s there via echo.
In this instance the the path to the solution is in understanding the environment you are working in. If the output error doesn’t have the complete converted string of the variables but the echo of the variables are showing all the correct values. Then the most likely culprit is an unexpected white space. If you receive an error stating “bash: <value>: comand not found“ this is also a strong indication there is an unexpected white space contained in the value of the variable.
This issue can also occur when the white space is intentional. Make sure that any value that contains a white space is kept within quotation marks.
The final issue that I have experienced during the build that I would like to review is probably an issue that you hopefully will never experience. And my solution to the problem is to never follow this practice. When selecting a base image for a docker build a major and minor revision number has to be selected, ie. 4.5. There is also a version number, ie 4.5.3. It is my advice to never select a specific version unless there is a bug in the latest version that requires that you use an older base image. And this should always be temporary.
To maintain the security of the application environment it should be the policy to run the update command in the Docker file to update the security packages within the image. In some cases the maintainer may maintain a version specific package repository to store security updates. But this will not last forever and eventually support for that version will be dropped. Amazon corretto is a popular image that is used for java applications. While Amazon maintains a list of the supported images. No one ever checks that list. When support for the image version is dropped your application will eventually become vulnerable to documented attacks a situation that obviously should be avoided.
Knowing no one looks at their list of supported image versions Amazon has come up with a clever way to ensure that you are always using a supported image and that is by breaking your build. In corretto specifically they update the package repository with an incompatible java runtime. Checking the supported image list is the last thought that would come to my mind when a build fails. So how did I discover this issue. By reading a lot of logs. Essentially after combing through the logs I was able to identify the incompatible runtime by comparing the build with the last successful build. Fortunately or unfortunately, depending on how you look at it, we had a half dozen builds that started to fail all at the same time. Hearing from my team members I began to compare all the failed builds looking for what they had in common, which was they all had the same base image. To ensure that this similarity was of importance, I then compared these builds with other projects whose builds were currently successful. And sure enough my suspicions were verified. All the currently successful builds did not specify the version number. Shortly after I made the discovery an emergency meeting was called to discuss the 6 or more builds that were failing. I’m glad to say that was a very short meeting.
Googling
Google is a great tool for finding a solution to the issues you encounter. At least that’s what I’ve been told, I personally will never know as I had switched to using DuckDuckGo long before I started my tech career. Jokes aside, while there is a strong likelihood that a solution to an issue can be found by using a search engine. It is highly recommended not to just use the first solution that is encountered. Most solutions online are a response to a specific scenario, which can be significantly different than what you are experiencing. It is unlikely that understanding the issue can be achieved with one source. Comparing multiple solutions will provide a better grasp of the how to resolve your specific issue.
Continuing with the theme of what not to do. Copy and pasting is something I feel should be avoided. Unless the code is too long to reasonably be expected to write it out, you should write it out. Writing out the code provides an opportunity to review the commands. The goal should always be to understand what it is that you are doing. Shortcuts may appear to be effective in the moment, but will ultimately hinder your growth. The goal should be to simultaneously fix the issue quickly while improving your skills.
There will eventually come a problem where there is no solution available online. The documentation appears to be inadequate to provide a solution and no one has made a post related to the issue. I encountered a similar situation when the AWS IAM roles of the organization were updated to use tagging to further control access to resources. There were several scripts that assumed these roles that were ran locally that broke after the update. The error message stated the roles were missing the necessary tags to complete the action.
Looking up the documentation stated that the tags had to be present in the role and reviewing the role on the AWS console showed that all the tags were present, but the error persisted. Searching for the error provided no solutions, not so much as a hint as to what the solution may be. The solution only came to me when I realized that the tags may not be inherited when assuming a role. The AWS CLI has a flag to add tags to the command. Updating the scripts with the relevant tags resolved the issue.
Conclusion
The length of time to resolve the problems for these examples range from minutes to a couple(meaning 2) days. They range in complexity and tedium. But they all have a thread in common. That is knowing what the desired behaviour should be and comparing it to the behaviour of the current issue. For this, robust documentation and logs are needed. Knowing how to read configuration files is a skill that will take time. Ensuring you understand all the components there in will allow issues to jump out at you.
When tackling an issue with no examples with which to compare, it is essential that you know your tools. Or at least you should know them well enough to make an informed hypothesis based on observed behaviours. The Dev environmentt is meant to be broken. Utilize it to experiment with the possible solutions. There will always come a time when you encounter a problem that you are unable to resolve on your own. It is my hope that this article will help to reduce the occurrence of those moments.