Prof. Lech Madeyski
Towards More Credible Empirical Research

Abstract: Researchers continuously struggle to provide sufficient evidence regarding the credibility of their findings. At the same time, practitioners have difficulties in trusting the results with limited credibility. Probably the most striking summary of the research crisis in multiple disciplines is given by Ioannidis who in his seminal paper [3] (with more than 4400 citations) claims that "Most Research Findings Are False for Most Research Designs and for Most Fields". Invalid recommendations or missing research findings can cost a lot of money. For example, according to Gartner, the size of the worldwide software industry in 2013 was US$407.3 billion [1]. Hence, invalid recommendations or missing research findings, in software engineering alone, can cost a lot of money.

An excellent example of problems with the credibility of research findings is given by Shepperd et al. [8] who meta-analysed 600 experimental results drawn from primary studies that compared methods for software defects prediction. They found that the explanatory factor that accounted for the largest percentage of the differences among studies (i.e., 30%) was research group. In contrast the prediction method, which was the main topic of research, accounted for only 1.3% of the variation among studies. Hence, they commented that there seems little point in conducting further primary studies until the problem that "it matters more who does the work than what is done" can be satisfactorily addressed. The analysed primary papers overlapped in terms of the data sets used, and the defect prediction modelling methods used. The fact that their results are inconsistent with respect to the impact of the defect prediction methods suggests reproducibility failures. At least some of the problems with the credibility of research findings can be mitigated by adopting reproducible research along with appropriate statistical methods.

Reproducible research refers to the idea that the ultimate product of research is the paper and its computational environment. That is, a reproducible research document incorporates the textual body of the paper, the data used by the study, and the analysis steps (e.g., statistical analyses, algorithms) used to process the data. The reason for adding the whole computational environment is that other researchers can then repeat the studies and reproduce the results, which in turn would increase credibility of research findings (that something is reproducible does not imply that it is correct, but greatly increases the possibility to spot problems, if any). Unfortunately, it is often impossible to reproduce data analyses, due to lack of raw data, sufficient summary statistics, or undefined analysis procedures. Thus wider adoption of reproducible research would be beneficial not only for Empirical Software Engineering [7], but also for other research areas. Furthermore, true research findings may be missed due to inadequate statistical methods that do not reflect the state of the art in statistics, while modern approaches, that avoid the pitfalls of the classic ones, are available [2, 4-6, 9].

Fortunately, adopting reproducible research, which involves the need to specify, fully, any statistical analysis, would allow other researchers to easily spot serious data analysis problems.

Concluding, the aim of the talk is to discuss some problems with credibility of empirical research in software engineering (and beyond) and how to mitigate them, to review pros and cons of reproducible research, and to suggest the supporting tools. The focus of the talk is closely related to the special issue on "Enhancing Credibility of Empirical Software Engineering" in the Information and Software Technology journal (Elsevier), where I serve as a guest co-editor [10].


[1] Gartner says worldwide software market grew 4.8 percent in 2013. Available at

[2] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. CRC Press, 2013.

[3] John P. A. Ioannidis. Why Most Published Research Findings Are False. PLoS Medicine, 2(8):696–701, 2005.

[4] Barbara Kitchenham. Robust Statistical Methods: Why, What and How: Keynote. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering (EASE 2015), pages 1:1–1:6, 2015. DOI:10.1145/2745802.2747956.

[5] Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong. Robust Statistical Methods for Empirical Software Engineering. Empirical Software Engineering, (Online First), 2017. URL:, DOI:10.1007/s10664-016-9437-5.

[6] Barbara Ann Kitchenham and Lech Madeyski. Meta-analysis. In Barbara Ann Kitchenham, David Budgen, and Pearl Brereton, editors, Evidence-Based Software Engineering and Systematic Reviews, chapter 11, pages 133–154. CRC Press, 2016. Available at Evidence-Based-Software-Engineering-and-Systematic-Reviews/Kitchenham-Budgen-Brereton/p/book/9781482228656.

[7] Lech Madeyski and Barbara Kitchenham. Would wider adoption of reproducible research be beneficial for empirical software engineering research? Journal of Intelligent & Fuzzy Systems, 32:1509–1521, 2017. URL: or, DOI:10.3233/JIFS-169146.

[8] Martin Shepperd, David Bowes, and Tracy Hall. Researcher Bias: The Use of Machine Learning in Software Defect Prediction. IEEE Transactions in Software Engineering, 40(6):603–616, 2014. DOI:10.1109/TSE.2014.2322358.

[9] Rand R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Elsevier, 3rd edition, 2012.

[10] Lech Madeyski, Barbara Kitchenham, Krzysztof Wnuk. Call for Papers to Special Issue on Enhancing Credibility of Empirical Software Engineering. 2017. URL: or

Biographical note: Lech Madeyski is Associate Professor and Founding Acting Head of the Department of Software Engineering at Wroclaw University of Science and Technology, Poland. He is a member of the Committee on Informatics of the Polish Academy of Sciences and chair of the Software Engineering Section of the Committee. In 2016 he was a Visiting Professor at Blekinge Institute of Technology, Sweden (BTH is among the world's most outstanding higher education institutions within systems and software engineering and ranked first within the EU, according to the Journal of Systems and Software). In 2014 he was a Visiting Researcher at Keele University, UK (invited by Prof. Barbara Kitchenham). His research focus is on software engineering, empirical software engineering, data science in software engineering and reproducible research. He is Founding Co-Editor-in-Chief of the e-Informatica Software Engineering Journal, Guest Co-Editor of the special issue on "Enhancing Credibility of Empirical Software Engineering" in the Information and Software Technology journal (Elsevier), and was one of the founders of the International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE). He has been involved in many international conferences and workshops, either as a PC member (e.g., EASE, PROFES, PROMISE, XP, MUTATION, ENASE) or chair and the steering committee member (KKIO). He has published papers in prestigious journals including, e.g., IEEE Transactions on Software Engineering, Empirical Software Engineering, Information and Software Technology, Software Quality Journal, IET Software, Journal of Intelligent and Fuzzy Systems, Cybernetics and Systems, Software Process: Improvement and Practice, and a book ("Test-Driven Development – An Empirical Evaluation of Agile Practice" focused on statistical analyses and meta-analysis of experiments, published by Springer).