Stochastic Systems Group
Home Research Group Members Programs  
Demos Calendar Publications Mission Statement Alumni

SSG Seminar Abstract

Validation of Image Segmentation with Simultaneous Truth and Performance Level Estimation (STAPLE)

Simon Warfield
Brigham and Women's Hospital, Harvard Medical School

Characterizing the performance of image segmentation approaches has been a persistent challenge. Performance analysis is important since segmentation algorithms often have limited accuracy and precision. Interactive drawing of the desired segmentation by human raters has often been the only acceptable approach, and yet suffers from intra-rater and inter-rater variability. Automated algorithms have been sought in order to remove the variability introduced by raters, but such algorithms must be assessed to ensure they are suitable for the task.

The performance of raters (human or algorithmic) generating segmentations of medical images has been difficult to quantify because of the difficulty of obtaining or estimating a known true segmentation for clinical data. Although physical and digital phantoms can be constructed for which ground truth is known or readily estimated, such phantoms do not fully reflect clinical images due to the difficulty of constructing phantoms which reproduce the full range of imaging characteristics and normal and pathological anatomical variability observed in clinical data.

Comparison to a collection of segmentations by raters is an attractive alternative since it can be carried out directly on the relevant clinical imaging data. However, the most appropriate measure or set of measures with which to compare such segmentations has not been clarified and several measures are used in practice.

I will describe an Expectation-Maximization algorithm for Simultaneous Truth and Performance Level Estimation (STAPLE). The algorithm considers a collection of segmentations and computes a probabilistic estimate of the true segmentation and a measure of the performance level represented by each segmentation. The source of each segmentation in the collection may be an appropriately trained human rater or raters, or may be an automated segmentation algorithm. I will present generalizations of the method suitable for segmentations with unordered labels, and illustrate its application. I will also present a recent extension of the concept suitable for characterizing rater performance when the segmentations consist of ordered labels, such as is common with level set representations.

Problems with this site should be emailed to