[Title Page] [TOC] [Prev] [Next] [End]

Appendix A. Related Case Studies


The structured testing methodology, including the testing techniques and related complexity metrics described in this document, is above all a practical tool to support effective software engineering. The vast majority of this document describes structured testing by using mathematics, examples, intuition, and theoretical analysis. This appendix provides empirical evidence by presenting several case studies from the software engineering literature.

Lloyd K. Mosemann III, while serving as Deputy Assistant Secretary of the Air Force (Communications, Computers, and Logistics), wrote:

For many years, most of us have been guilty of throwing verbal rocks at those developing software metrics. We have denigrated their work on the basis that their products did not "fully" or "completely" measure such quantities as software size, structure, performance, and productivity. While we have been waiting for the perfect metrics we demand, however, we have been measuring almost nothing... I believe the time has come for us to cease our admiration of the metrics problem and to start implementing some metrics. [MOSEMANN]
Fortunately, complexity measurement has become a standard part of software engineering practice, and the metrics described in this document have gained wide acceptance. Cyclomatic complexity in particular is calculated in some form by nearly every commercial software complexity analysis product. The papers in this section are listed chronologically, from the earliest work in 1977 to the most recent in 1996. Most examine various uses of cyclomatic complexity in software development.

A.1 Myers

Myers [MYERS1] calculated cyclomatic complexity for the programs contained in the classic text by Kernighan and Plauger [KERNIGHAN]. For every case in which an improved program was suggested, this improvement resulted in a lower value of v(G). Myers describes one interesting case in which Kernighan and Plauger suggested two simplified versions of a program that has v(G) equal to 16. Myers found that both improvements resulted in a complexity value of 10.

A.2 Schneidewind and Hoffman

Schneidewind and Hoffman [SCHNEIDEWIND1] performed an experiment that related complexity to various error characteristics of software. They wrote:

The propensity to make programming errors and the rates of error detection and correction are dependent on program complexity. Knowledge of these relationships can be used to avoid error-prone structures in software design and to devise a testing strategy which is based on anticipated difficulty of error detection and correction. An experiment in software error data collection and analysis was conducted in order to study these relationships under conditions where the error data could be carefully defined and collected. Several complexity measures which can be defined in terms of the directed graph representation of a program, such as cyclomatic number, were analyzed with respect to the following error characteristics: errors found, time between error detections, and error correction time. Significant relationships were found between complexity measures and error characteristics. The meaning of directed graph structural properties in terms of the complexity of the programming and testing tasks was examined.
Based on this experiment we conclude that, for similar programming environments and assuming a stable programming personnel situation, structure would have a significant effect on the number of errors made and labor time required to find and correct the errors ... complexity measures serve to partition structures into high or low error occurrence according to whether the complexity measure values are high or low, respectively.

A.3 Walsh

Walsh [WALSH] collected data on the number of software errors detected during the development phase of the AEGIS Naval Weapon System. The system contained a total of 276 modules, approximately half of which had a v(G) less than 10 and half with v(G) of 10 or greater. The average error rate for modules with complexity less than 10 was 4.6 per 100 source statements, while the corresponding error rate for the more complex modules was 5.6. As Walsh pointed out, one would expect a similar pattern for undetected errors as well, so that less complex modules will have a lower error rate after delivery as well.

A.4 Henry, Kafura, and Harris

Henry, Kafura, and Harris [HENRY] reported empirical error data collected on the UNIX operating system. The authors obtained a list of errors from the UNIX Users Group and performed correlations with three metrics. The cyclomatic complexity was the most closely related to errors of the three-the correlation between v(G) and number of errors was .96.

A.5 Sheppard and Kruesi

Sheppard and Kruesi [SHEPPARD] examined the performance of programmers in constructing programs from various specification formats. An automated data collection system recorded the complete sequence of events involved in constructing and debugging the program. An analysis of the error data revealed that the major source of difficulty was related to the control flow and not to such factors as the number of statements or variables. The most difficult program had the most complex decision structure, while a considerably easier program performed extremely complex arithmetic calculations but had a simpler decision structure. Thus, v(G) can be used to measure a difficult aspect of programming.

A.6 Carver

Carver [CARVER] collected data on approximately 14,000 lines of code in order to study the change in complexity metrics from code initiation to code completion. Structured programming practices were followed and code reviews used in the generation of the code. The cyclomatic complexity values ranged from 8 to 78 for this code, with changes to v(G) averaging 35% with a median value of 25%, indicating that "approximately 1/3 of the control flow logic was added after code initiation. ...A percentage of 35% increase suggests that either the original designs were incomplete or the programmers implemented the designs in an unsatisfactory manner." The complexity metric variance investigation benefits include: "First, it is valuable for anticipating complexity increases during design so that proper module sizing can be addressed. Secondly, it provides input for measuring the completeness of the original design specification." The study concludes that complexity increases in the ranges noted indicate that designers must not only plan for growth, but also modularize the code to allow for controlled growth, otherwise modules will be implemented with unacceptable complexity levels.

A.7 Kafura and Reddy

A study was performed by Kafura and Reddy [KAFURA] that related cyclomatic complexity as well as six other software complexity metrics to the experience of various maintenance activities performed on a database management system consisting of 16,000 lines of FORTRAN code. The authors used a subjective evaluation technique, relating the quantitative measures obtained from the metrics to assessments by informed experts who are very familiar with the systems being studied. This was done to determine if the information obtained from the metrics would be consistent with expert experience and could be used as a guide to avoid poor maintenance work.

Changes in system level complexity for each of three versions were analyzed. After an analysis of the results of the metrics, the authors concluded that the "change in complexities over time agrees with what one would expect." Also noted was that the complexity growth curves "seemed to support the view that maintenance activities - either enhancements or repairs - impact many different aspects of the system simultaneously."

An observation made is that changes in procedure complexity can be monitored to question large changes in complexity that are not planned. The authors also note that developers tend to avoid changing very complex procedures during maintenance, even if the alternative changes lead to a degradation in overall design quality, and suggest that managers monitor the implementation of those changes.

A.8 Gibson and Senn

Gibson and Senn [GIBSON] conducted an experiment using COBOL code to investigate the relationship between system structure and maintainability, with the purpose of determining whether a set of six objective complexity metrics might offer potential as project management tools. They studied the effects when an old, badly structured system is "improved" and how the improvements impact maintainability. The objectives were to determine whether improvements in the system structure result in measurable improvements in programmer performance, whether programmers can discern system differences, and whether these system differences can be measured by existing, automated metrics.

While all six metrics appeared to be related to performance, the metrics were grouped into two sets of three metrics each. The cyclomatic complexity metric was in the set which provided "relative rankings consistent with relative time and the frequency of serious ripple effect errors", while the other set was related more to the frequency of primary errors introduced with modifications. To address the first issue of the study, the average maintenance time did decrease when the system structure was improved, measured in terms of the effect of structure on time, accuracy, and confidence.

With respect to their second objective, the study revealed that structural differences were not discernible to programmers since programmers in the study could not separate out the complexity of the system from the complexity of the task being undertaken. The authors stated that this result "offers important evidence on the efficacy of subjectively based complexity metrics... The inspector's perceptions may not conform to those of the maintenance programmers, which may affect the predictive ability of subjective metrics over the life of the system." It should be noted that cyclomatic complexity is not subjective.

The third objective dealt with the effect of system structure and metrics, and when the system structure was improved, cyclomatic complexity did indeed decrease. The authors conclude that "objectively based metrics might provide more reliable information for managers than subjectively based measures."

A.9 Ward

Ward [WARD] studied two large-scale firmware projects at HP's Waltham Division that had a very low postrelease defect density. He found that approximately 60% of post-release defects had been introduced during the implementation phase, indicating that implementation improvements such as limiting complexity had the largest potential for quality improvement. Upon further investigation, he found that cyclomatic complexity had a .8 correlation with error density for the projects, and concluded that limiting complexity was a significant factor in software quality. He also reported successful usage of the baseline method to generate module test cases.

A.10 Caldiera and Basili

Caldiera and Basili [CALDIERA] conducted experiments by using software metrics to identify and "qualify" reusable software components from existing systems in order to reduce the amount of code that experts must analyze. The data consisted of 187,000 lines of ANSI C code, spanning 9 existing systems which represented a variety of applications. They used four metrics in their study, one of which was cyclomatic complexity. The reason for using cyclomatic complexity in their study was that, "The component complexity affects reuse cost and quality, taking into account the characteristics of the component's control flow. As with volume, reuse of a component with very low complexity may not repay the cost, whereas high component complexity may indicate poor quality - low readability, poor testability, and a higher possibility of errors. On the other hand, high complexity with high regularity of implementation suggests high functional usefulness. Therefore, for this measure we need both an upper and a lower bound in the basic model."

By using the metrics, the authors were able to identify candidates for code reuse. They determined that they could obtain satisfactory results using values that they calculated as extremes for the ranges of acceptable values. In this study, the upper bound for v(G) was 15.00 with a lower bound of 5.00. The authors concluded that "these case studies show that reusable components have measurable properties that can be synthesized in a simple quantitative model."

A.11 Gill and Kemerer

Gill and Kemerer [GILL] presented research that studied "the relationship between McCabe's cyclomatic complexity and software maintenance productivity, given that a metric which measures complexity should prove to be a useful predictor of maintenance costs." They conducted this study to specifically validate the cyclomatic complexity metric for use in software testing and maintenance. The project analyzed approximately 150,000 lines of code from 19 software systems, comprising 834 modules of which 771 were written in Pascal and 63 in FORTRAN. The authors wrote:

Knowledge of the impact of cyclomatic complexity on maintenance productivity is potentially more valuable than that of NCSLOC [noncomment source lines of code], because managers typically do not have a great deal of control over the size of a program since it is intimately connected to the size of the application. However, by measuring and adopting complexity standards and/or by using CASE restructuring tools, they can manage unnecessary cyclomatic complexity.
The data from this study supported previous research regarding the high correlation between cyclomatic complexity and NCSLOC. The authors extended the idea of cyclomatic complexity to "cyclomatic density", which is the ratio of the module's cyclomatic complexity to its length in NCSLOC. The intent is to factor out the size component of complexity. It has the effect of normalizing the complexity of a module, and therefore its maintenance difficulty. The density ratio was tested in this research and "shown to be a statistically significant single-value predictor of maintenance productivity."

A.12 Schneidewind

Schneidewind performed a validation study [SCHNEIDEWIND2], the purpose being to determine whether cyclomatic complexity and size metrics could be used to control factor reliability (represented as a factor error count), either singly or in combination. A factor is a quality factor, "an attribute that contributes to its quality, where software quality is defined as the degree to which software possesses a desired combination of attributes." The data used in the study was collected from actual software projects consisting of 112 total procedures, of which 31 were known to have errors, and 81 with no errors. The Pascal language was used for the 1,600 source code statements that were included in the study.

One of the statistical procedures used was a chi-square test, to determine whether cyclomatic complexity could discriminate between those procedures with errors (as in low-quality software) versus those without errors (a high-quality software). The misclassifications that were studied were divided into two groups: Type 1 misclassifications included procedures with errors that were classified as not having errors. The type 2 category was just the opposite: those modules without errors were classified as having errors. The study found that as cyclomatic complexity increases, Type 1 misclassifications increased because an increasing number of high complexity procedures, of which many had errors, were incorrectly classified as having no errors. However, as cyclomatic complexity decreased, Type 2 misclassifications increased, because an increasing number of low complexity procedures, many having no errors, were incorrectly classified as having errors.

The most significant v(G) threshold for error prediction in this study was at v(G) <= 3, meaning that 3 would be used as a critical value of cyclomatic complexity to discriminate between components containing errors and those that do not. (This value was supported by plotting points, as well as by the chi-square test.) Correctly classified were 75 of the 81 procedures containing no errors, and 21 of the 31 procedures known to have errors. A similar analysis was performed on source code size for these procedures, with an optimal value of 15 being found. The author concludes that size and cyclomatic complexity are valid with respect to the "Discriminative Power" criterion, and either could be used to distinguish between acceptable and unacceptable quality for the application studied and similar applications. Although the author's focus was more on the validation technique than the specific validation study, it is interesting that the thresholds for both complexity and size were found to be much less than those in common usage.

The author notes that quality control "is to allow software managers to identify software that has unacceptable quality sufficiently early in the development process to take corrective action." Quality could be controlled throughout a component's life cycle, so that if v(G) increased from 3 to 8 due to a design change, it could indicate a possible degradation in quality.

A.13 Case study at Stratus Computer

Cyclomatic and essential complexity were measured for a tape driver subsystem before and after a reengineering project. The new version was significantly less complex than the original. Twelve months after release, a follow-up study showed that fewer than five serious or critical errors had been discovered in the new software, a dramatic increase in reliability over the original system. Based on this result, complexity analysis was recommended as part of the design process. [GEORGE]

A.14 Case study at Digital Equipment Corporation

The relationship between defect corrections in production software and both cyclomatic complexity and the number of lines of code was studied. Defect corrections were used since they map defects to individual source modules in an objectively measurable way. Defect corrections were more strongly correlated with cyclomatic complexity than with the number of lines of code.

The correlation between complexity and defect corrections was also investigated when modules with complexity greater than 12 and modules with complexity less than 12 were considered separately. In the high-complexity subset, complexity had a definite correlation with defect corrections. However, there was little or no correlation in the low-complexity subset. The suggested explanation is that other factors outweigh complexity as a source of error when complexity is low, but that the impact of complexity on defect occurrence is greater when complexity is high. [HEIMANN]

A.15 Case study at U.S. Army Missile Command

Automated analysis of cyclomatic complexity was used as part of several Independent Validation and Verification projects, and found to be more accurate and comprehensive than the manual methods for identifying overly complex software that had been previously used. Automated complexity analysis also yielded a significant reduction in cost. Based on the success of these projects, the same techniques were applied to other projects in various languages, including Ultra, Ada, C, C++, PL/M, and FORTRAN. [ROBERTS]

A.16 Coleman et al

A recent study was performed by Coleman, Ash, Lowther, and Oman [COLEMAN] to demonstrate how automating the analysis of software maintainability can be used to guide software-related decision making. The authors developed a four-metric polynomial, known as HPMAS (HP Multidimensional Assessment Model), in which cyclomatic complexity was one of the four metrics used. The data used to test and validate their model utilized actual systems provided by Hewlett-Packard and the Department of Defense which encompassed eleven software systems written in C for a Unix platform. They used their model to study three aspects of system maintenance: To study pre/postanalysis of maintenance changes (over 1,200 lines of code); to rank-order module maintainability (236,000 lines of code); and, to compare software systems (one system consisted of 236,275 lines of code and the other 243,273 lines). In each case, the model proved extremely useful.

The authors write that in each case,

the results from our analysis corresponded to the maintenance engineers' "intuition" about the maintainability of the (sub)system components. But, in every case, the automated analysis provided additional data that was useful in supporting or providing credence for the experts' opinions. Our analyses have assisted in buy-versus-build decisions, targeting subcomponents for perfective maintenance, controlling software quality and entropy over several versions of the same software, identifying change-prone subcomponents, and assessing the effects of reengineering efforts.

A.17 Case study at the AT&T Advanced Software Construction Center

Various aspects of the structured testing methodology have been integrated into the overall development process, concentrating on code inspections and unit testing. The techniques helped control complexity of code, select code for inspection, and measure basis path coverage during unit testing. The development team also used complexity analysis to re-design overly complex modules, which reduced the unit testing effort. [BORGER]

A.18 Case study at Sterling Software

The cyclomatic, design, and essential complexity metrics as well as test coverage were applied to a successful large-scale prototyping effort. The project was part of the Air Force Geographical Information Handling System (AFGIHS), a 3 year, $3.5 million effort to produce an advanced prototype for transition to a full scale development to support Air Force requirements for geospatial display and analysis. Complexity analysis was used to monitor software quality, reduce risk, and optimize testing resources. [BOYD]

A.19 Case Study at GTE Government Systems

Cyclomatic and design complexity were included in a large-scale software metrics collection effort spanning many projects over an eight-year time period. Projects with similar cyclomatic and design complexity profiles were found to follow similar schedules, with respect to the amount of time spent on each development phase. Much of the development work involves the reengineering of legacy systems. Comparison of the cumulative complexity distribution of a legacy program with known programs in the metrics database is used to anticipate the development phases required to reengineer the legacy program. [SCHULTZ-JONES]

A.20 Case study at Elsag Bailey Process Automation

The cyclomatic and essential complexity metrics were used on a successful project that involved the development of application and communications interface software using a client/server architecture to provide bi-directional communications to digital systems. Complexity analysis helped identify potentially defective modules by pinpointing highly complex and unstructured code, based on pre-defined company and industry standards. Acquisition of this information early in the product release cycle helped minimize post-release maintenance costs. These techniques allowed the product verification efforts to be quantified and monitored. [VENTRESS]

A.21 Koshgoftaar et al

Koshgoftaar, Allen, Kalaichelvan, and Goel [KOSHGOFTAAR] performed a case study using a large telecommunications system that consisted of 1.3 million lines of code that was written in a language similar to Pascal. The authors applied discriminant analysis to identify fault-prone areas in a system prior to testing. They developed a model that incorporated cyclomatic complexity as one of the design product metrics. To evaluate the model's ability to predict fault-prone modules, they compared the quality of the predictions based on test data to the actual quality. The results were successful. The authors found that "design product metrics based on call graphs and control-flow graphs can be useful indicators of fault-prone modules... The study provided empirical evidence that in a software-maintenance context, a large system's quality can be predicted from measurements of early development products and reuse indicators."



[Title Page] [TOC] [Prev] [Next] [End]