Sub-challenge 1

Disease Grading


The purpose of this challenge is to directly compare the methods developed for automatic image grading of Diabetic Retinopathy (DR)


DR leads to gradual changes in vasculature structure and resulting abnormalities such as microaneurysms, hemorrhages, hard exudates, and cotton wool spots. Along with the changes, there may be a presence of venous beading, retinal neovascularization which can be utilized to classify DR retinopathy in one of the two phases known as non-proliferative diabetic retinopathy (NPDR) and proliferative diabetic retinopathy (PDR) as shown in Figure. The determination of DR severity based on criteria given in Table. It is used to decide the need for treatment and follow-up recommendations.

Figure. Stages of DR: (a) NPDR and (b) PDR

Table. International Clinical Diabetic Retinopathy (DR) Severity Scale

Disease Severity Level Findings
Grade – 0: No apparent retinopathy No visible sign of abnormalities
Grade – 1: Mild – NPDR Only presence of Microaneurysms
Grade – 2: Moderate – NPDR More than just microaneurysms but less than severe NPDR
Grade – 3: Severe – NPDR

Moderate NPDR and any of the following:

• > 20 intraretinal hemorrhages

• Venous beading

• Intraretinal microvascular abnormalities

• No signs of PDR

Grade – 4: PDR

Severe NPDR and one or both of the following:

• Neovascularization

• Vitreous/preretinal hemorrhage


Automatic grading of images for DR based on the international standards of clinical relevance given in Table. For this sub-challenge participants will have to submit results for grading of DR.

Availability of the data

Initially 60% of the database with ground truth (i.e. training set with labels) will be released on December 1, 2019 (four months before the date of the symposium). The validation set 20% will be provided on January 15, 2020 (two and half months before the day of on-site competition) and 20% (test set without labels) of remaining data will be released on the day of challenge. The released data will be divided proportionately per grade. The test set results will be used for comparison and analysis of submitted results.

NOTE:  Because the dataset includes dual-view images (the optic disc as center and the fovea as the center) for each eye, participants will be expected to utilize more diagnosis information to develop a robust and accurate model. The dual-view images of the same eye may have different grading labels. For example, the optic disc centered view image is labeled with DR grade 1 but the fovea centered view image may have DR grade 3. The final diagnosis result will be the highest grading score. In this example case, the DR grade is determined as 3 rather than 1. This issue happens in the real clinical screening process. We expect participants to develop a model that can take two-view images into account to have a robust DR grading system.

Evaluation metrics

For disease grading tasks i.e. the sub-challenge - 1 and sub-challenge - 3, we will use Weighted Kappa as the evaluation metric to determine the performance of the participating algorithm. Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two ratings (i.e. ground-truths and submitted results). This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters). In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores.

Results have 5 possible ratings, 0,1,2,3,4. The quadratic weighted kappa is calculated as follows. First, an N x N histogram matrix O is constructed, such that Oi,j corresponds to the number of adoption records that have a rating of i (actual) and received a predicted rating j. An N-by-N matrix of weights, w, is calculated based on the difference between actual and predicted rating scores.

An N-by-N histogram matrix of expected ratings, E, is calculated, assuming that there is no correlation between rating scores. This is calculated as the outer product between the actual rating's histogram vector of ratings and the predicted rating's histogram vector of ratings, normalized such that E and O have the same sum.

From these three matrices, the quadratic weighted kappa is calculated.

The example to calculate your own Quadratic Weighted kappa Metric can be found here:

Result Submission

Submissions of results will be accepted in .csv file with header three headers as Test Image No. – DR Grade.

Also, participants need to submit a Docker container including the training and testing code.

Link to submission instructions:

Sample instruction:

Performance Evaluation

This challenge evaluates the performance of the algorithms for image classification accuracy using the available grades from the experts.