Skip to content

🩺 Curated collection of datasets for breast cancer segmentation

Notifications You must be signed in to change notification settings

pablogiaccaglia/Breast-Cancer-Segmentation-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Breast-Cancer-Segmentation-Datasets

🩺 Curated collection of datasets for breast cancer segmentation

πŸ“™ Motivation

In this repository you can find several anonimized mammography datasets from various sources. The peculiarity of this collection of datasets is that for each of them a curated subset has been selected for precise automatic segmentation applications, such as training deep learning architectures.

The rationale behind this choice is that, while looking for publicy available data for traninig an enhanced U-NET, apart from the fact that most of the datasets available online have to be handleded ad hoc both from the file organization point of view and the file formats, I noticed that the majority of provided masks are very imprecise (due to the usage of automatic tools, like in the case of CBIS-DDSM) or approximate (the masks are simple ovals covering the cancer mass, like in the case of INbreast).

Such samples won't allow a proper training of a neural network for automatic segmentation, so a proper data cleaning is mandatory. In this repository you can find 5 datasets, whose details are described below.

Each dataset's folder contains the 'original' folder (as downloaded from the source website), 'SELECTED' samples folders and one or more Python scripts for reorganizing original files' into a more usable structure and remove duplicated/unpaired samples. Note that these scripts have been already applied and the resulting folders are contined within the 'original' folder

Preprocessing steps have been applied to each dataset, apart from CDD-CESM, to produce the samples contained in the 'preprocessed' folders. Related code is contained inside preprocessing.py. Finally suitable samples have been selected manually.

πŸ—„οΈ Datasets

BCDR

First Iberian wide-ranging annotated BREAST CANCER DIGITAL REPOSITORY (BCDR). The BCDR is a compilation of Breast Cancer anonymized patients' cases annotated by expert radiologists containing clinical data (detected anomalies, breast density, BIRADS classification, etc.), lesions outlines, and image-based features computed from Craniocaudal and Mediolateral oblique mammography image views.

Size Original Image Format Original Mask Format Selected Size Selected Image format Selected Mask format
485 TIF TIF 72 PNG PNG
Image Mask Image Mask

Folder structure

BCDR
β”‚ 
β”œβ”€β”€ original
β”‚     β”‚
β”‚     β”œ BCDR-Images-Original -> original patient folder with all the screenings
β”‚     β”‚
β”‚     β”œ BCDR-Masks-Original -> patient folder with all the masks, created from cvs files through refactorBCDR.py script
β”‚     β”‚
β”‚     β”œ BCDR-Original-Preprocessed-IMG -> preprocessed patient folder with all the png files of the masks
β”‚     β”œ BCDR-Original-Preprocessed-MSK -> preprocessed patient folder with all the png files of the masks
β”‚     β”‚
β”‚     β”” csv -> contains csvs about clinical data (detected anomalies, breast density, BIRADS, etc.), lesions outlines, and image-based features
β”‚           β”‚
β”‚           β”œ bcdr_d01_outlines.csv
β”‚           β”‚
β”‚           β”” bcdr_d02_outlines.csv
β”‚ 
β”œβ”€β”€ BCDR-SELECTED-IMGS -> selected screenings
β”‚
β”œβ”€β”€ BCDR-SELECTED-MASKS -> selected masks
β”‚
└── refactorBCDR.py -> script to reorganize files inside 'original' folder

INbreast

The INbreast database is a mammographic database, with images acquired at a Breast Centre, located in a Hospital de SΓ£o JoΓ£o, Breast Centre, Porto, Portugal. INbreast has a total of 115 cases (410 images) of which 90 cases are from women with both breasts (4 images per case) and 25 cases are from mastectomy patients (2 images per case). Several types of lesions (masses, calcifications, asymmetries, and distortions) are included. Accurate contours made by specialists are also provided in XML format

Size Original Image Format Original Mask Format Selected Size Selected Image format Selected Mask format
410 DICOM XML 64 PNG PNG
Image Mask Image Mask

Folder structure

INbreast
β”‚ 
β”œβ”€β”€ original
β”‚     β”‚
β”‚     β”œ AllDICOMs -> original patient folder with all the screenings
β”‚     β”‚
β”‚     β”œ AllROI -> original patient folder with all the rois of the anomalies detected
β”‚     β”‚
β”‚     β”œ AllXML -> original patient folder with all the xml files of the masks
β”‚     β”‚
β”‚     β”œ MedicalReports -> folder containing the associated medical reports
β”‚     β”‚
β”‚     β”œ PectoralMuscle -> folder containing the manual annotation of the pectoral muscle boundary.
β”‚     β”‚
β”‚     β”œ INBREAST-Original-Preprocessed-IMG -> preprocessed patient folder with all the png files of the masks
β”‚     β”‚
β”‚     β”œ INbreast.xls -> contains a summary of the database, including the BIRADS classification.
β”‚     β”‚
β”‚     β”œ INbreast.csv -> subset of the INbreast.xls file
β”‚     β”‚
β”‚     β”œ INbreast-Original-Preprocessed-IMG -> preprocessed patient folder with all the png files of the masks
β”‚     β”œ INbreast-Original-Preprocessed-MSK -> preprocessed patient folder with all the png files of the masks
β”‚     β”‚
β”‚     β”” README.txt -> contains some info about the dataset
β”‚ 
β”œβ”€β”€ INBREAST-SELECTED-IMGS -> selected screenings
β”‚
β”œβ”€β”€ INBREAST-SELECTED-MASKS -> selected masks
β”‚
└── refactorINBbreast.py -> contains methods for the preprocessing of INbreast data, which takes place in 'preprocessing.py'

CSAW-S

The CSAW-S dataset is a companion subset of CSAW, a large cohort of mammography data gathered from the entire population of Stockholm invited for screening between 2008 and 2015, which is available for research (Dembrower et al., 2019).
The CSAW-S subset contains mammography screenings from 172 different patients with annotations for semantic segmentation. The patients are split into a test set of 26 images from 23 patients and training/validation set containing 312 images from 150 patients.

Size Original Image Format Original Mask Format Selected Size Selected Image format Selected Mask format
338 PNG PNG 152 PNG PNG
Image Mask Image Mask

Folder structure

CSAW
β”‚ 
β”œβ”€β”€ original
β”‚     β”‚
β”‚     β”œ anonymized_dataset -> patient folders with all the screenings and annotations (tumors - expert 1)
β”‚     β”‚
β”‚     β”œ test_data
β”‚     β”‚     β”‚
β”‚     β”‚     β”œ anonymized_dataset -> patient folders with all the screenings
β”‚     β”‚     β”‚
β”‚     β”‚     β”œ annotator_1 -> annotations from expert 1
β”‚     β”‚     β”‚
β”‚     β”‚     β”œ annotator_2 -> annotations from expert 2
β”‚     β”‚     β”‚
β”‚     β”‚     β”” annotator_3 -> annotations from expert 3
β”‚     β”‚
β”‚     β”œ CSAW-Original-IMG -> result folder of refactorCSAW.py script
β”‚     β”œ CSAW-Original-MSK -> result folder of refactorCSAW.py script
β”‚     β”œ CSAW-Original-Mammary-Gland -> result folder of refactorCSAW.py script
β”‚     β”‚
β”‚     β”œ CSAW-Original-Preprocessed-IMG -> preprocessed patient folder with all the png files of the masks
β”‚     β”” CSAW-Original-Preprocessed-MSK -> preprocessed patient folder with all the png files of the masks
β”‚ 
β”œβ”€β”€ CSAW-SELECTED-IMGS -> selected screenings
β”‚
β”œβ”€β”€ CSAW-SELECTED-MASKS -> selected masks
β”‚
└── refactorCSAW.py -> script to reorganize files inside 'original' folder

CBIS-DDSM

This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems. The CBIS-DDSM collection includes a subset of the DDSM data selected and curated by a trained mammographer. The images have been decompressed and converted to DICOM format.

Size Original Image Format Original Mask Format Selected Size Selected Image format Selected Mask format
2620 DICOM DICOM 521 PNG PNG
Image Mask Image Mask

The dataset can be downloaded directly from the project website, but the directory structure is messy. Following the ideas here explained, script inside refactorCBIS.py reorganizes the dicom files in a tidy way. The resulting folders are available for download here

Folder structure

CBIS 
β”‚ 
β”œβ”€β”€ CBIS-CALC-SELECTED-IMGS -> selected screenings
β”‚
β”œβ”€β”€ CBIS-CALC-SELECTED-MASKS-> selected masks
β”‚ 
β”œβ”€β”€ CBIS-MASS-SELECTED-IMGS -> selected screenings
β”‚
β”œβ”€β”€ CBIS-MASS-SELECTED-MASKS-> selected masks
β”‚
β”œβ”€β”€ CBIS-Original-Calc-Preprocessed-Complete-IMG -> all calcification acquisitions preprocessed
β”‚
β”œβ”€β”€ CBIS-Original-Calc-Preprocessed-Complete-MSK -> all calcification acquisitions masks preprocessed
β”‚
β”œβ”€β”€ CBIS-Original-Mass-Preprocessed-Complete-IMG -> all mass acquisitions preprocessed
β”‚
β”œβ”€β”€ CBIS-Original-Mass-Preprocessed-Complete-MSK -> all mass acquisitions masks preprocessed
β”‚
β”œβ”€β”€ csv -> contains csvs about clinical data (detected anomalies, breast density, BIRADS, etc.), lesions outlines, and image-based features
β”‚
β”œβ”€β”€ handle_multi_tumor.py-> script to handle screening with multiple masks
β”‚
└── refactorCBIS.py -> script to reorganize files in the original folder

CDD-CESM

This dataset is a collection of 2,006 high-resolution Contrast-enhanced spectral mammography (CESM) images with annotations and medical reports. CESM is done using the standard digital mammography equipment, with additional software that performs dual-energy image acquisition. The images were converted from DICOM to JPEG. They have an average of 2355 x 1315 pixels. Manual segmentation annotation are provided for the abnormal findings in each image (CSV file).
Each image with its corresponding manual annotation (breast composition, mass shape, mass margin, mass density, architectural distortion, asymmetries, calcification type, calcification distribution, mass enhancement pattern, non-mass enhancement pattern, non-mass enhancement distribution, and overall BIRADS assessment) is compiled into 1 Excel file.

Size Original Image Format Original Mask Format Selected Size Selected Image format Selected Mask format
1003 JPG JPG - PNG PNG
Image Mask Image Mask

Folder structure

CDD
β”‚ 
β”œβ”€β”€ original
β”‚     β”‚
β”‚     β”‚
β”‚     β”œ CDD-CESM
β”‚     β”‚     β”‚
β”‚     β”‚     β”œ Low energy images of CDD-CES
β”‚     β”‚     β”‚
β”‚     β”‚     β”” Subtracted images of CDD-CESM
β”‚     β”‚
β”‚     β”” Radiology_hand_drawn_segmentations_v2.csv -> manual segmentation annotations are provided for the abnormal findings in each image
β”‚ 
β”œβ”€β”€ CDD-SELECTED-IMGS -> selected screenings
β”‚
β”œβ”€β”€ CDD-SELECTED-MASKS -> selected masks
β”‚
└── draw_segs.py -> script to draw segmentations from csv

About

🩺 Curated collection of datasets for breast cancer segmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages