Challenge
Sub-challenge 3

Generalizability of a Diabetic Retinopathy Grading System

Aim

The purpose of this challenge is to explore the generalizability of a Diabetic Retinopathy (DR) grading system. We expect participants to build a model that transfer the capability of DR diagnosis learned from a large number of regular fundus images to the ultra-widefield retinal images.

Background

The regular fundus images are used for initial screening; the widefield scanning performs as a further screening mean because it can provide complete eye information. For instance, the fundus cameras only can provide a small region around the optic disc but the widefield retinal imaging can observe 82% size of the retinal (see Figure)

Figure Comparison of regular fundus cameras view and the widefield retinal imaging view

Task

Automatic grading of images for DR based on the international standards of clinical relevance given in Table (see Sub-challenge - 1). For this sub-challenge participants will have to submit results for grading of DR. The difference from Task1 is that we expect participants to build a model that can transfer the capability of DR diagnosis learned from a large number of regular fundus images to the ultra-widefield retinal images. The model’s generalizability will be evaluated on ultra-widefield retinal images.

The sample ultra-widefield retinal images are as follows:

Availability of the data

Initially 60% of the database with ground truth (i.e. training set with labels) will be released on December 1, 2019 (four months before the date of the symposium). The validation set 20% will be provided on January 15, 2020 (two and half months before the day of on-site competition) and 20% (test set without labels) of remaining data will be released on the day of challenge. The released data will be divided proportionately per grade. The test set results will be used for comparison and analysis of submitted results.

Evaluation metrics

For disease grading tasks i.e. the sub-challenge - 1 and sub-challenge - 3, we will use Weighted Kappa as the evaluation metric to determine the performance of the participating algorithm. Submissions are scored based on the quadratic weighted kappa, which measures the agreement between two ratings (i.e. ground-truths and submitted results). This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters). In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores.

Results have 5 possible ratings, 0,1,2,3,4. The quadratic weighted kappa is calculated as follows. First, an N x N histogram matrix O is constructed, such that Oi,j corresponds to the number of adoption records that have a rating of i (actual) and received a predicted rating j. An N-by-N matrix of weights, w, is calculated based on the difference between actual and predicted rating scores.

An N-by-N histogram matrix of expected ratings, E, is calculated, assuming that there is no correlation between rating scores. This is calculated as the outer product between the actual rating's histogram vector of ratings and the predicted rating's histogram vector of ratings, normalized such that E and O have the same sum.

From these three matrices, the quadratic weighted kappa is calculated.

The example to calculate your own Quadratic Weighted kappa Metric can be found here: https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps

Result Submission

Submissions of results will be accepted in .csv file with header three headers as Test Image No. – DR Grade.

Also, participants need to submit a Docker container including the training and testing code.

Link to submission instructions: https://www.docker.com/

Sample instruction: https://drive.google.com/file/d/18B7GVE_KE9COcS8KnwSZzfoaxoWjmRHB/view

Performance Evaluation

This challenge evaluates the performance of the algorithms for image classification accuracy using the available grades from the experts.

Remarks

For all sub-challenges, participants can also use other datasets to train/develop their models, as long as it is clearly stated in the submitted paper (See rules for data usage).