as CentOS 7 if Red Hat Enterprise Linux 7 is not available to the reader. troubleshooting skills and your knowledge of Red Hat Enterprise Linux, then this book. Red Hat Enterprise Linux Troubleshooting Guide - Selection from Red Hat Enterprise Linux Troubleshooting Guide [Book]. Author: Benjamin Cane Pages: Publication Date Release Date ISBN: Product Group:Book Download PDF Red.
|Language:||English, Spanish, German|
|Genre:||Health & Fitness|
|Distribution:||Free* [*Registration needed]|
Red Hat Enterprise Linux Troubleshooting Guide - Sample Chapter - Free download as PDF File .pdf), Text File .txt) or read online for free. Chapter No. Identify, capture and resolve common issues faced by Red Hat Enterprise Linux administrators using best practices and advanced. Also have system administration knowledge under Red Hat Enterprise Linux, including: In Red Hat Linux Troubleshooting (RH), system administrators will Manual configuration of network cards; connectivity issues; network diagnostic.
It is a best practice to first ask the user reporting an issue some basic details of the issue, then after obtaining enough information, duplicate the issue. Once the issue has been duplicated, the next logical step is to run the necessary commands to troubleshoot and investigate the cause of the issue.
It is very common to find yourself returning to previous steps during the troubleshooting process. After you have identified some key errors, you might find that you must ask the original reporter for additional information. When troubleshooting, do not be afraid to take a few steps backwards in order to gain clarity of the issue at hand. Establishing a hypothesis With the scientific method, once a problem statement has been formulated it is then time to establish a hypothesis.
With the troubleshooting process, after you have identified the issue, gathered the information about the issue such as errors, system current state, and so on, it is also time to establish what you believe caused or is causing the issue. Some issues, however, might not require much of a hypothesis.
It is common that errors in log files or the systems current state might answer why the issue occurred. In such scenarios, you can simply resolve the issue and move on to the Documentation step. For issues that are not cut and dry, you will need to put together a hypothesis of the root cause.
This is necessary as the next step after forming a hypothesis is attempting to resolve the issue. It is difficult to resolve an issue if you do not have at least, a theory of the root cause.
Here are a few techniques that can be used to help form a hypothesis. Putting together patterns While performing data collection during the previous steps, you might start to see patterns. Patterns can be something as simple as similar log entries across multiple services, the type of failure that occurred such as, multiple services going offline , or even a reoccurring spike in system resource utilization.
These patterns can be used to formulate a theory of the issue. To drive the point home, let's go through a real-world scenario. You are managing a server that both runs a web application and receives e-mails. You have a monitoring system that detected an error with the web service and created a ticket. While investigating the ticket, you also receive a call from an e-mail user stating they are getting e-mail bounce backs. When you ask the user to read the error to you they mention No space left on device.
Let's break down this scenario:. We have also received reports from e-mail users with errors indicative of a file system being full. Could all of this mean that Apache is down because the file system is full? Should we investigate it? Is this something that I've encountered before? The above breakdown leads into the next technique for forming a hypothesis.
It might sound simple but is often forgotten. How do we know this? Well, simple, we have seen it before. Maybe we have seen that same error with e-mail bounce backs or maybe we have seen the error from other services. The point is, the error is familiar and the error generally means one thing. Remembering common errors can be extremely useful for the intuitive types such as the Educated Guesser and Adaptor; this is something they tend to naturally perform. For the Data Collector, a handy trick would be to keep a reference table of common errors handy.
From my experience, most Data Collectors tend to keep a set of notes that contain things such as common commands or steps for procedures. Adding common errors and the meaning behind those errors are a great way for systematic thinkers such as Data Collectors to establish a hypothesis faster.
Overall, establishing a hypothesis is important for all types of troubleshooters. This is the area where the intuitive thinkers such as Educated Guessers and Adaptors excel. Generally, those types of troubleshooters will form a hypothesis sooner, even if sometimes those hypotheses are not always correct. Trial and error In the scientific method, once a hypothesis is formed, the next stage is experimentation.
With troubleshooting, this equates to attempting to resolve the issue. Some issues are simple and can be resolved using a standard procedure or steps from experience. Other issues, however, are not as simple. Sometimes, the hypothesis turns out to be wrong or the issue ends up being more complicated than initially thought. In such cases, it might take multiple attempts to resolve the issue. I personally like to think of this as similar to trial and error. In general, you might have an idea of what is wrong the hypothesis and an idea on how to resolve it.
You attempt to resolve it trial , and if that doesn't work error , you move on to the next possible solution. Start by creating a backup To those taking up a new role as a Linux Systems Administrator, if there were only one piece of advice I could give, it would be one that most have learned the hard way: Many times as systems administrators we find ourselves needing to change a configuration file or delete a few unneeded files in order to solve an issue.
Unfortunately, we might think we know what needs to be removed or changed but are not always correct. If a backup was taken, then the change can simply be restored to its previous state, however, without a backup.
Thus reverting changes is not as easy. A backup can consist of many things, it can be a full system backup using something like rdiff-backup, a VM snapshot, or something as simple as creating a copy of a file. For those interested in seeing the extent of this tip in practice, simply run the following command on any server that has more than four systems administrators and has been around for several years: Getting help In many cases at this point the issue is resolved, but much like each step in the troubleshooting process, it depends on the issue at hand.
While getting help is not exactly a troubleshooting step, it is often the next logical step if you cannot solve the issue on your own. When looking for help, there are generally six resources available:. Books Books such as this one are good for referencing commands or troubleshooting steps for particular types of issues. Other books such as the ones that specialize on a specific technology are good for referencing how that technology works.
In previous years, it was not uncommon to see a senior admin with a bookshelf full of technical books at his or her disposal. In today's world, as books are more frequently seen in a digital format, they are even easier to use as references. The digital format makes them searchable and allows readers to find specific sections faster than a traditional printed version. These books are a list of processes and procedures used daily by the operations team to keep the production environments operating normally.
Sometimes, these Runbooks would contain information for provisioning new servers and sometimes they would be dedicated to troubleshooting. In today's world, these Runbooks have mostly been replaced by Team Wikis, these Wikis will often have the same content but are online.
They also tend to be searchable and easier to keep up to date, which means they are frequently more relevant than a traditional printed Runbook. The benefit of Team Wikis and Runbooks are that not only can they often address issues that are specific to your environment, but they can also resolve those issues.
There are many ways to configure services such as Apache, and there are even more ways that external systems create dependencies on these services. In some environments, you might be able to simply restart Apache whenever there is an issue, but in others, you might actually have to go through several prerequisite steps.
If there is a specific process that needs to be followed before restarting a service, it is a best practice to document the process in either a Team Wiki or Runbook. Google Google is such a common tool for systems administrators that at one point there were specific search portals available at google. Google has depreciated these search portals but that doesn't mean that the number of times systems administrators use Google or any other search engine for troubleshooting has decreased.
In fact, in today's world, it is not uncommon to hear the words "I would Google it" in technical interviews. A few tips for those new to using Google for systems administration tasks are:. If you copy and paste a full error message removing the server specific text you will likely find more relevant results: For example, searching for kdumpctl: No memory reserved for crash kernel returns results, whereas searching for memory reserved for crash kernel returns , results. You can find an online version of any man page by searching for man then a command such as man netstat.
You can wrap an error in double quotes to refine search results to those that contain the same error. Asking what you're looking for in the form of a question usually results in tutorials. While Google can be a great resource, the results should always be taken with a grain of salt. Often while searching for an error on Google, you might find a suggested command that offers little explanation but simply says "run this and it will fix it".
Be very cautious when running these commands, it is important that any command you execute on a system should be a command you are familiar with. You should always know what a command does before executing it. Man pages When Google is not available or even sometimes when it is, the best source of information on commands or Linux, in general, are the man pages.
The man pages are core Linux manual documents that are accessible via the man command. To look up documentation for the netstat command, for example, simply run the following: As you can see, this command outputs not only the information on what the netstat command is, but also contains a quick synopsis of usage information such as the following: Also, it gives detailed descriptions of each flag and what it does: See the description in route 8 for details.
In general, the base manual pages for the core system and libraries are distributed with the man-pages package. The man pages for specific commands such as top, netstat, or ps are distributed as part of that command's installation package. The reason for this is because the documentation of individual commands and components is left to the package maintainers.
This can mean that some commands are not documented to the level of others. In general, however, the man pages are extremely useful sources of information and can answer most day-to-day questions.
Reading a man page In the previous example, we can see that the man page for netstat includes a few sections of information. In general, man pages have a consistent layout with some common sections that can be found within most man pages.
The following is a simple list of some of these common sections:. Name The Name section generally contains the name of the command and a very brief description of the command. The following is the name section from the ps command's man page: NAME ps - report a snapshot of the current processes. Synopsis The Synopsis section of a command's man page will generally list the command followed by the possible command flags or options.
A very good example of this section can be seen in the netstat command's synopsis: Description The Description section will often contain a longer description of the command as well as a list and explanation of the various command options. The following snippet is from the cat command's man page: The description section is very useful, since it goes beyond simply looking up options.
This section is often where you will find documentation about the nuances of commands. Examples Often man pages will also include examples of using the command: The preceding is a snippet from the cat command's man page. We can see, in this example, how to use cat to read from files and standard input in one command. This section is often where I find new ways of using commands that I've used many times before. Additional sections In addition to the previous section, you might also see sections such as See Also, Files, Author, and History.
These sections can also contain useful information; however, not every man page will have them. Info documentation Along with man pages, Linux systems generally also contain info documentation, which are designed to contain additional documentation, which go beyond that, within man pages. The method to invoke the info documentation is similar to man pages, simply execute the info command followed by the subject you wish to view: Referencing more than commands In addition to using man pages and info documentation to look up commands; these tools can also be used to view documentation around other items such as system calls or configuration files.
As an example, if you were to use man to search for the term signal, you would see the following: Avoid its use: See Portability below. Signal is a very important system call and a core concept of Linux. Knowing that it is possible to use the man and info commands to look up core Linux concepts and behaviors can be very useful during troubleshooting.
Installing man pages Red Hat Enterprise Linux based distributions generally include the man-pages package; if your system does not have the man-pages package installed, you can install it with the yum command: Red Hat kernel docs In addition to man pages, the Red Hat distribution also has a package called kerneldoc. This package contains quite a bit of information on how the internals of the system works.
This resource is quite useful for deeper troubleshooting such as adjusting kernel tunables or understanding how ext4 filesystems utilize the journal. By default, the kernel-doc package is not installed, however, it can be easily installed using the yum command: People Whether it is a friend or a team leader, there is certain etiquette when asking others for help.
The following is a list of things that people tend to expect when asked to help solve an issue. When I am asked for help, I would expect you to:. Try to resolve it yourself: When escalating an issue, it is always best to at least try to follow the Understanding the problem statement and Forming a hypothesis steps of the troubleshooting process.
Document what you've tried: Documentation is key to escalating issues or getting help. The better you document the steps tried and errors found, the faster it will be for others to identify and resolve the issue.
Explain what you think the issue is and what was reported: When you escalate the issue, one of the first things to point out is your hypothesis.
Often this can help expedite resolution by leading the next person to a possible solution without having to perform data collection activities. Mention whether there is anything else that happened to this system recently: Often issues come in pairs, it is important to highlight all factors of what is happening on the system or systems affected. The preceding list, while not extensive, is important as each of these key pieces of information can help the next person troubleshoot the issue effectively.
Following up When escalating issues, it is always best to follow up with that other person to find out what they did and how they did it. This is important as it will show the person you asked that you are willing to learn more, which many times will lead to them taking time to explain how they resolved and identified the issue.
Interactions like these will give you more knowledge and help build your system's administration skills and experience. Documentation Documentation is a critical step in the troubleshooting process. At every step during the process, it is key to take note and document the actions being performed. Why is it important to document? Three reasons mainly:.
When escalating the issue, the more information you have written down the more you can pass on to another. If the issue is a reoccurring issue, the documentation can be used to update a Team Wiki or Runbook.
Depending on environments, the documentation can be anything from simple notes saved in a text file on a local system to required notes for a ticket system. Each work environment is different but a general rule is there is no such thing as too much documentation. For Data Collectors, this step is fairly natural. As most Data Collector personalities will generally keep quite a few notes for their own personal use.
For Educated Guessers, this step might seem unnecessary. However, for any issue that is reoccurring or needs to be escalated, documentation is critical. What kind of information should be documented?
The following list is a good starting point but as with most things in troubleshooting, it depends on the environment and the issue:. Commands executed during the information gathering steps within reason, it is not required to include every cd or ls command executed.
Steps taken during attempts to resolve the issue, including specific commands executed. With the preceding items well documented, if the issue reoccurs, it is relatively simple to take the documentation and move it to a Team Wiki. The benefit to this is that a Wiki article can be used by other team members who need to resolve the same issue during reoccurrences. One of the three reasons listed previously for documentation is to use the documentation during Root Cause Analysis, which leads to our next topic Establishing a Root Cause Analysis.
Root cause analysis Root cause analysis is a process that is performed after incidents occur. The goal of the RCA process is to identify the root cause of an incident and identify any possible corrective actions to prevent the same incident from occurring again. These corrective actions might be as simple as establishing user training to reconfiguring Apache across all web servers.
The RCA process is not unique to technology and is a widely practiced process in fields such as aviation and occupational safety. In these fields, an incident is often more than simply a few computers being offline. They are incidents where a person's life might have been at risk. The problem as it was reported One of the first steps in the troubleshooting process is to identify the problem; this information is a key piece of information for RCAs.
The importance can vary in reason depending on the issue. Sometimes, this information will show whether or not the issue was correctly identified. Most times, it can serve as an estimate of the impact of the issue. Understanding the impact of an issue can be very important, for some companies and issues it could mean lost revenue; for other companies, it could mean damage to their brand or depending on the issue, it could mean nothing at all.
The actual root cause of the problem This element of a Root Cause Analysis is pretty self-explanatory on its importance. However, sometimes it might not be possible to identify a root cause.
In this chapter and in Chapter 12, Root Cause Analysis of an Unexpected Reboot, I will discuss how to handle issues where a full root cause is unavailable. A timeline of events and actions taken If we use an aviation incident as an example, it is easy to see where a timeline of events such as, when did the plane take off, when were passengers boarded, and when did the maintenance crew finish their evaluation, can be useful. A timeline for technology incidents can also be very useful, as it can be used to identify the length of impact and when key actions are taken.
A good timeline should consist of times and major events of the incident. The following is an example timeline of a technology incident:.
At Any key data points to validate the root cause In addition to a timeline of events, the RCA should also include key data points.
To use the aviation example again, a key data point would be the weather conditions during the incident, the work hours of those involved, or the condition of the aircraft. Our timeline example included a few key data points, which include:. Whether the data points are on their own or within a timeline, it is important to ensure those data points are well documented in the RCA. A plan of action to prevent the incident from reoccurring The entire point of performing a root cause analysis is to establish why an incident occurred and the plan of action to prevent it from happening again.
Unfortunately, this is an area that I see many RCA's neglect. An RCA process can be useful when implemented well; however, when implemented poorly they can turn into a waste of time and resources. Often with poor implementations, you will find that RCAs are required for every incident big or small.
The problem with this is that it leads to a reduction of quality in the RCAs. An RCA should only be performed when the incident causes significant impact. For example, hardware failures are not preventable, you can proactively identify hardware failure using tools such as smartd for hard drives but apart from replacing them you cannot always prevent them from failing.
When an engineer is required to establish a root cause for something as common as hardware failing, they neglect the root cause process. When engineers neglect the RCA process for one type of incident, it can spread to other types of incidents causing quality of RCAs to suffer.
An RCA should only be reserved for incidents with significant impact. Minor incidents or routine incidents should never have an RCA requirement; they should however, be tracked. By tracking the number of hard drives that have been replaced along with the make and model of those hard drives, it is possible to identify hardware quality issues. The same is true with routine incidents such as resetting user passwords. By tracking these types of incidents, it is possible to identify possible areas of improvement.
Establishing a root cause To give a better understanding of the RCA process, let's use a hypothetical problem seen in production environments. A web application crashed when writing to a file. After logging into the system, you were able to find that the application crashed because the file system where the application attempted to write to was full. The root cause is not always the obvious cause Was the root cause of the issue the fact that the file system was full?
While the file system being full might have caused the application to crash, this is what is called a contributing factor. A contributing factor, such as the filesystem being full can be corrected but this will not prevent the issue from reoccurring. At this point, it is important to identify why the filesystem was full. On further investigation, you find that it was due to a co-worker disabling a cron job that removes old application files.
After the cron job was disabled, the available space on the filesystem slowly kept decreasing. Eventually, the filesystem was percent utilized. In this case, the root cause of the issue was the disabled cron job. Sometimes you must sacrifice a root cause analysis Let's look at another hypothetical situation, where an issue causes an outage.
Since the issue caused significant impact, it will absolutely require an RCA. The problem is, in order to resolve the issue, you will need to perform an activity that eliminates the possibility of performing an accurate RCA. These situations sometimes require a judgment call, whether to live with the outage a little longer or resolve the outage and sacrifice any chance of an RCA. Unfortunately, there is no single answer for these situations, the correct answer depends on both the issue and the environment affected.
While working on financial systems, I find myself having to make this decision often. With mission critical systems, the answer was almost always to restore service above performing the root cause analysis.
However, whenever possible, it is always preferred to first capture data even if that data cannot be reviewed immediately. Understanding your environment The final section in this chapter is one of the most important best practices I can suggest.
The final section covers the importance of understanding your environment. Embed Size px. Start on. Show related SlideShares at end. WordPress Shortcode. Published in: Full Name Comment goes here. Are you sure you want to Yes No. Be the first to like this. No Downloads. Views Total views. Actions Shares. Embeds 0 No embeds.
No notes for slide. Book Details Author: Paperback Brand: The ability to navigate and use basic Linux commands is expected. It provides the stability to take on today's challenges and the flexibility to adapt to tomorrow's demands. In this book, you begin with simple troubleshooting best practices and get an overview of the Linux commands used for troubleshooting. The book will cover the troubleshooting methods for web applications and services such as Apache and MySQL.
Then, you will learn to identify system perform 4. If you want to download this book, click link in the next page 5.