Branch Out to Prevent Failures in the Field

September 21, 2014

Resolve warranty risk through fault tree analysis.

T. Schorn

(Click here to see the story as it appears in the September issue of Modern Casting.)

The cost of managing a warranty claim can be many times higher than a simple return of the cast metal part. The value-added cost is much higher: An installed part usually requires disassembly, replacement and shipping, and it potentially incurs costs for lost time, labor or other peripheral damage or loss.

The expectations for a casting supplier normally are to investigate the root cause of a problem and take prompt corrective action to eliminate it. Depending on the volume of parts supplied and the severity of the occurrence, the end user might require information about the probability of other parts in the field having the same problem. They also will request, at a minimum, short-term corrective action to eliminate the defect from the stream of supply. These requests can overwhelm the casting supplier.

Metal castings often are returned in a warranty environment where it is impossible, in a subsequent forensic evaluation, to determine the exact cause of the defect. The complexity of multiple potential common causes, along with equipment and human factors, requires more than simple corrective action. Fault tree analysis can help casting buyers and their suppliers address and resolve difficult quality problems.

Warranty Investigation Method

A proper warranty investigation in such a situation must investigate six aspects, as shown in Figure 1. In the diagram, design refers both to the casting supplied and the design of the equipment or system in which it is assembled. Manufacture refers to the process by which it was built. Communications refers to the information provided to the end user regarding use of the equipment, maintenance and any warning associated with it.

The casting supplier cannot investigate the assembly of the casting into the unit or its field use and potential abuse leading to the warranty claim. An evaluation of the communication and warnings delivered to the user also is out of the casting supplier’s reach, yet necessary to consider as part of a thorough investigation.

If the part design came from the equipment provider, it is their responsibility to evaluate. The material that composes the casting and its specified properties likely were incorporated into the design delivered to the metalcaster. Again, this falls on the casting purchaser to evaluate.

If a claim comes to the attention of a metalcaster, at least some preliminary evaluation of these aspects of warranty is accomplished. The assumption is that the component’s manufacture was at fault, either in failing to deliver the specified mechanical properties, dimensional conformity or part soundness required by the contract.

Typically, the metalcaster is provided narrative information about the failure and a sample of the part that failed with which to perform forensic analysis. Good practice would indicate the part has some traceability to the process inspections and records of the casting or subsequent operations, such as heat treatment, that may have been performed at the metalcasting facility. The casting provider will need access, in-house or purchased, to the means to understand the nature of the failure and identify the type and extent of any defect.

Common and Assignable Causes

In the nomenclature of quality, process variability generally is attributed to two types of causes: assignable and common. Assignable causes can be named and singly create the objectionable variation in the process. If, for example, a misrun defect was observed in a casting and it was known that the metal was poured 45 degrees below the lower specification limit established for the part, then the cause would be assignable. Corrective action would be applied such that greater reliability for pour temperature would be obtained. Unfortunately, a great many industrial processes exist where such obvious assignable causes are not present yet process variability is such that at some small rate, defects are created in the process output.

For example, consider an aluminum part cast in a permanent mold. The initial pour temperature will have some variability; the cycle time (and temperature of the mold) also will be variable within a small range. The thickness, thus the insulating value, of the die coat will be variable over a certain range, dependent on operator skill, the materials involved and their concentration, and so on. The forced cooling system will vary in its ability to remove heat, its timing within the solidification cycle, etc. All of these relatively small variations, and others, can combine in unknown ways and lead to a casting defect. The higher the volume and repetition of these circumstances, the greater the potential for variability and defects. These can be classified as common causes, because they are common to the process and not one of them can be assigned easily as the culprit for any specific casting defect.

Once it is accepted that a process does not yield 100% conforming product, it is up to some form of inspection to sieve out the defective articles. Yet none of the in-process inspections—visual, functional or those using technology such as radiography or leak detection methods—is perfectly reliable; none of them catches every defective part 100% of the time.

This leaves the typical part supplier facing a warranty claim in a very troubling situation. It is tempting to explain the situation as related to an assignable process problem and correct something that may or may not have been the actual cause. This short-term solution may succeed in appeasing a customer but ultimately wastes time. It adds a burden to the process of questionable benefit and will not likely reduce the incidence of the defect itself or its escape to the end user.

A Better Way

There is a quality tool that can enable organizations to improve their customer warranty communications. Fault tree analysis (FTA) is used to estimate the probability of an event by establishing the potential cause-and-effect linkages and assigning a probability to these intermediate causes.

FTA methods require the user to understand the methods of basic probability and to estimate the likelihood of human error. The power of FTA lies in both the visual representation of these chains of cause and effect as well as the quantification of risk. These advantages make it ideal for application to warranty communication.

The two basic logical relationships of AND and OR are combined to describe the potential chains of cause and effect leading to the “top event”—the undesirable occurrence that one is interested in avoiding or explaining. Figure 2 is an illustration of the combination. The diagram communicates a situation where the only way to elevate the temperature too high in the heat treatment solution furnace zone 1 is to have a failure of both equipment and check systems. Equipment failure could arise from a burner failure or a circulating fan failure. Check systems could fail if either the alarm fails to alert staff or it is defeated by being mis-set.

The simple box describing an event was supplemented by an oval. The oval indicates a basic event, one that is not going to be investigated any further for cause in the FTA. The basic events are parallel to “root causes” in common forms of WHY-WHY analyses. Events that arise from the basic events may be referred to as intermediate events. The “top event” is the result of all intermediate and basic events.

Calculating Probability

While the effort required to establish a sequence of causes and effects in logical interconnection is instructive and valuable in itself, the great power of the tool lies in its ability to quantify the probability of the “top event” and to assess the contribution of the basic and intermediate events in numerical terms. If one were to establish the probability of occurrence of the basic events, then by using the mathematics of probability of independent events, a probability can be calculated for the “top event.”

For two independent events that must occur together, the probability of the result is the product of the events necessary for its occurrence. When events are connected by an “AND,” the “top event” probability is given as P1 x P2 = P. Where three events are so related (and independent), each having probabilities P1, P2 and P3, then the “top event” likelihood would be given as P1 x P2 x P3 = P.

Figure 3 illustrates the calculation of probability when two events are connected by an “OR” relationship. Where either one of two independent events can cause the “top event,” the probabilities P1 and P2 combine as:
P = P1 + P2 – P1P2

The subtracted portion represents that overlapping probability where both events occur at the same time. It is not correct to double count these occurrences. Where three events are related by an “OR” connection:
P = P1 + P2 + P3 – P1P2 – P2P3 – P1P3 + P1P2P3

To summarize the math, the probabilities for “AND” connections are multiplied and the probabilities for “OR” logical connections are added. Figure 4 provides an illustration of a section of an FTA where the calculation rules are applied.

Assigning Probability

The basic events in such chains of cause and effect generally will fall into three categories, each with its own source of probability information.

Equipment Failure: Many basic events can be the failure or fault of a piece of equipment or a component within a larger system of equipment, such as a plumbing leak or an electrical failure. Establishing a frequency for these events relies on maintenance records. Even if maintenance does not keep such records directly, the purchasing department may have records for the purchase of spare parts associated with the event. Knowing the spare part ordering frequency over a year or two can give the engineer an idea of the frequency of a given failure.

Process Capability: If the probability of a dimensional defect (say, a certain amount of mismatch on a parting line or a mislocated boss) is needed, a capability study can determine the frequency of a specific defect at a given severity. Capability studies may not be waiting in the file to be used for this purpose, but they can be performed with this calculation in mind. Process characteristics, not just product characteristics, can be evaluated in this manner. If, for example, the frequency of too high concentration of coolant in a machining line is needed, this data can be obtained from a capability study. It might take much longer than simply gathering a collection of parts and measuring them dimensionally, but the results may be just as important in the evaluation of the “top event.” Calculating frequency of occurrence from a capability study for a given limit may be found by reference to a text on statistical process control or one focused on capability evaluation.

Human Error: A fair number of basic events likely will be human errors, such as forgetting something, misadjustments or failure to follow procedure (assuming one exists). Human error is distressingly common, and its probability has been studied extensively. It is unnecessary for the engineer to guess wildly at how often an individual on the line, for example, is going to forget a step in a procedure. Such evaluations have already been made and are available in several reference materials. Quality engineers, industrial engineers and those responsible for plant safety ought to become familiar with these figures. The base rate for errors of omission is one chance in one hundred and the base rate for errors of commission is three in one thousand opportunities. However, many factors influence these base rates and in both directions.

Constructing a Fault Tree

Fault trees are group efforts. They must be facilitated by someone who is knowledgeable of both the method and the overall failure and general process language. The facilitator must be sufficiently organized to deal with the potentially great many details and focus on one area and one “branch” of the fault tree logic at a time.

Practically, the fault tree is sketched out beginning with a well defined “top event.” It is critical that everyone knows what is the exact defect or unwanted occurrence. The more specifically the “top event” is described, the easier it will be to include potential causes and define their probability. As intermediate events are added, these are built down to the basic events in each branch, branch by branch.

Once the structure is defined, the determination of the probability of the basic events must be assigned to group members to research. The team should include representation from production, maintenance and quality groups, at least. People familiar with the processes in detail, including what goes wrong and how often, provide a great reality check on the construction and fitness of an FTA.

Group members will require training in the overall strategy of FTA prior to initiating the work. Later, when the probabilities need to be determined, a second training session can cover the math associated with turning process capability numbers into probabilities or estimates. “We had three failures over the last 5 years of maintenance records,” becomes a meaningful and standardized unit of probability such as errors/year, if estimating warranty.

Depending on the size and complexity of an FTA, the final spreadsheet may be the result of multiple revisions and editing. The initial effort is expected to have a very high probability for the “top event”; that is the result of estimations made in numerous places, all of them likely conservatively overestimating the chance of error. As improvement efforts are made, the change from the previous probability to the new one quantifies the change in risk, and this movement is a very real measure of the improvement in the frequency of the “top event.”

Using an FTA With Warranty

The primary way to utilize the FTA to make improvement in an organized and deliberate manner is to perform a contribution analysis. This becomes the check part of a closed PDCA (Plan-Do-Check-Act) loop to drive a reduction in the probability of the “top event,” the unwanted warranty defect.

Essentially, this is a mathematical procedure for calculating the percent contribution of intermediate and basic events to the “top event” (see Fig. 5).
Simple arithmetic reveals the relative contributions to the “top event” by the three main causal branches, as given in Table 1. The probabilities of the events A, B and C simply add to the “top event” so these contributions are simply the ratio (e.g., PA/PTOP) of the two probabilities.

The basic events D and E in turn contribute to Event A. Their contribution can be evaluated as shown in Table 2. This contribution analysis is performed exactly as the one in Table 1. The contribution to the “top event” can be calculated using the same technique.

With “AND” relationships, the method of calculation of contribution must be modified somewhat. Table 3 shows the contribution to event B. Here, the ratio of event F to event B does not provide a correct net contribution to event B, since the events F and G probabilities are multiplied to yield event B’s probability. Instead, the relative contribution of event F can be given by this equation:

100[PF/(PF+PG)] = % PF Contribution

The net impact of event PF to the “top event” is then:

100[(%PF)(PB)/PTOP] = % PF Contribution

The remainder of Figure 5 can be evaluated for contribution using methods already described and as shown in this figure.

In Tables 2, 3 and 4 the basic events have been shaded to distinguish them from the intermediate events. By reviewing the contributions of these basic events to the “top event,” a Pareto analysis can be performed. This provides insight into the multiplicity of contributors where action should be taken first (and what it might yield in terms of improvement in the “top event”). Figure 6 illustrates the Pareto of contribution for the FTA illustration in Figure 5.

The Improvement Process

Using contribution analysis of the FTA provides a prioritized direction for improvement and a useful metric for quantifying effectiveness. This activity is intended to be cyclical and continual in a closed PDCA loop, as shown in Figure 7. Improvements may come from several perspectives, but the following may prove most useful in a warranty evaluation where error probabilities are relatively low.

Error proofing: Human error rates will tend to be the highest probability events on the FTA.
Adding redundancy: If two systems must fail for a fault to occur, the probability of their occurring simultaneously is much smaller than any one check system.
Evaluating the factors that lead to basic events to further break down (and expose to correction) underlying causes.
Evaluating and improving gage error through MSA (Measurement System Analysis) methods.
Refining probability evaluations via specific capability studies or maintenance record mining.

Reporting and Communication

Interaction with the warranty engineer or quality engineer and the customer regarding this type of analysis may require some education regarding the meaning of the results and the approach used. The following is recommended:

The cross functional team involved in the analysis should be identified.
The “top event” should be clearly communicated and decisions about its scope should be explained.
The overall structure and logic of the FTA should be described, branch by branch, so it makes sense to the customer.
The meaning of the probability numbers should be explained, including the units chosen.
The translation from probability of occurrence to actual expected parts in the field should be explicit (and constant over the revisions of the FTA).
The provisional and conservative nature of the “top event” probability should be described. The probability is more useful as a measurement of improvement than as an absolute prediction of rate of occurrence of a defect.
A comparison of actual warranty to predicted warranty within the FTA should be made, especially over a period of years.
Where suppliers have impact on the results, they should be involved in reviewing or improving on the FTA logic and estimated probabilities.
Reports of FTA results, especially at the initial use, should be made in person where the appropriate explanations and descriptions can be accomplished efficiently.

A Case Study

In one metalcasting facility, machined aluminum castings were being provided to the automotive market and were required not to leak air. The aluminum wall thickness with this requirement was often at 0.11 in. (3mm), and process capability for this requirement was not 100%. Various methods were used over the years until the metalcaster settled on a helium leak detection system.

Considering the end user’s high level of dissatisfaction with even one product that leaked, an FTA was constructed with a team of production, maintenance and quality staff. The team initially presented this to the customer after receiving four hours of FTA training over several months. In a series of six team meetings, an FTA was constructed with 64 basic events, and a “top event” probability of 1.25 x 10-4 events/hour of production was calculated. The customer was impressed with the causal analysis and commitment to further reduce the incidence rate by way of objective contribution analysis.

Revision B of the FTA succeeded in reducing the probability to 6.23 x 10-5 events/hour, about five months later. A third revision was undertaken after a year of further experience and data collection. Revision C has a projected incidence rate of 2.82 x 10-5 events/hour, a 77% reduction in roughly two years of work. An added benefit of the communication with the customer through FTA is the understanding that the defect rate will never be zero, and it is unnecessary and unhelpful to initiate standard corrective action documents at each occurrence. The foundry is committed only to annual, serious review of the FTA with a Pareto contribution analysis. Individual claim parts are received as they occur and simply reviewed to ensure no obvious assignable causes acted to create the defect.

The metalcaster reports internal benefits from a clear understanding of what matters in the control of this defect and a much better team environment in problem solving between maintenance, quality and production staff. Maintenance also reported a better objective basis for spare parts inventory levels as a result of the close scrutiny the records received through the failure probability determination effort.

The use of this method was developed as a part of an overall improvement cycle for warranty. Casting suppliers and purchasers can consider this method a means of combating not only warranty claims, but ineffective, superficial responses to such claims. The increase in customer understanding and appreciation for the investigation is a welcome side benefit.

This article was adapted from a Paper 14-048 presented at the 2014 AFS Metalcasting Congress.