Defect prediction models help software quality assurance teams to effectively allocate their limited resources to the most defect-prone software modules. Model validation techniques, such as k-fold cross-validation, use this historical data to estimate how well a model will perform in the future. However, little is known about how accurate the performance estimates of these model validation techniques tend to be. In this paper, we set out to investigate the bias and variance of model validation techniques in the domain of defect prediction. A preliminary analysis of 101 publicly available defect prediction datasets suggests that 77% of them are highly susceptible to producing unstable results. Hence, selecting an appropriate model validation technique is a critical experimental design choice. Based on an analysis of 256 studies in the defect prediction literature, we select the 12 most commonly adopted model validation techniques for evaluation. Through a case study of data from 18 systems that span both open-source and proprietary domains, we derive the following practical guidelines for future defect prediction studies: (1) the single holdout validation techniques should be avoided; and (2) researchers should use the out-of-sample bootstrap validation technique instead of holdout or the commonly-used cross-validation techniques.