Making Custom Datasets              (TFRecords)

To obtain the best performance using TPUs, the package accepts only TFRecords Dataset(s). Either you have ready-made TFRecords Dataset or you want to obtain TFRecords Dataset for your own image data. This section is devoted to explaining about how to obtain your own Image Dataset TFRecords.

Note -> To utilize the TFRecords dataset created, ensure that the dataset is public while uploading on Kaggle.
Note -> No need to have TPU ON for making TFRecords files. Making TFRecords is CPU computation.
Note -> It is better to make TFRecords dataset on Google Colab ( > 75 GB) as Kaggle Kernels have limited Disk Space( < 5 GB). Download the datasets after you are done. Upload them on Kaggle as public datasets. Input the data in the Kaggle Notebooks.

Let's get started with tfrecords_maker module of quick_ml package.

Labeled Data

For Labeled Data, make sure that the dataset follows the following structure ->

/ Data


         | - Class1
         | - Class2
         | - Class3
         | - ClassN


where Class1, Class2, .. , ClassN denote the folders of images as well the class of the images. These shall serve as the labels for classification.

This is usually used for training and validation data.
However, it can also be used to create labeled Test Data.


To make labeled data, there are 2 options.

  • Convert entire image folder to tfrecords file.

  • Split the Image Dataset folder in a specified ratio & make tfrecords files.

A) Convert entire image folder to tfrecords file

To create a tfrecords dataset file from this, the following would be the function call :-

However, you would need the address (addrs) and (labels) and shuffle them up. This has been implemented for you in the get_addrs_label. Follow the line of code below.

where DATA_DIR directs to the path of the Dataset with the structure mentioned in the beginning of Labeled Data TFRecords.

Obtain the tfrecords file by giving a output_filename you would desire your output file to have using this line of code.

Ensure that you save the Labeled TFRecord Format somewhere as you would require it to read the data at a later stage. Preferred way of achieving this is through saving it in the Markdown cell below the above code cell. After uploading on Kaggle and making dataset public, adding the Labeled TFRecords Format in the Dataset Description.


B) Split the Image Dataset Folder in a specified ratio & make tfrecords files.




To create two tfrecords datasets from the Image Dataset Folder, use the following line of code :-


     DATA_DIR -> This refers to the Path to the Dataset Folder following the structure mentioned above.
     outfile1name + outfile2name -> Give names to the corresponding output files obtained through the split of the dataset as outfile1name & outfile2name.
     split_size_ratio -> Mention the split size ratio you would to divide your dataset into.
     IMAGE_SIZE -> The Image Size you would like to set all the images of your dataset in the tfrecords file.

     output1folder + output2folder - Names of the output folders corresponding to the output files

     num_parts1 + num_parts2 - Number of parts of output1file & output2file. Default, 1

     zip - Zip the output files. Default, False


Doesn't return anything. Stores the TFRecords file(s) to your disk. Ensure sufficient disk space.

Creation of Labeled TFRecords Dataset with CSV File

This feature helps you to create Labeled TFRecords Dataset for Unstructured Dataset Folder and a csv file mentioning the filenames and the corresponding labels. 

CSV File Structure -> 

                         Image       |           Id    |

                         ---------------- -----------------

                          File1.jpg   |       Class A

                          File2.jpg  |       Class B

                          File3.jpg  |       Class C

                          File4.jpg  |       Class A

                           ...             |       ...

                          FileN.jpg  |       Class N

Data Folder Structure 

               /Data |


                         | -> Image1.jpg

                         | -> Image2.jpg

                         | -> Image3.jpg

                         | -> .....

                         | -> ImageN.jpg

To make Labeled TFRecords Dataset through this method, there are two options. 

a) create_tfrecords_from_csv

b) create_split_tfrecords_from_csv

Method a) : create_tfrecords_from_csv

First, do the necessary imports. 

After successful execution of the above statement, do the following function call.

Arguments Description =>

    DATA_DIR - The Data Directory path of the training data images

    csv_path - The path of the csv file containing the image names and their corresponding labels.

    outfile1name - Name of the tfrecords output file you would like to have

    IMAGE_SIZE - The dimensions of the IMAGE to be written in TFRecords Dataset. Default, (192,192)

Returns =>

    Stores TFRecords File upon successful execution.

Method b) : create_split_tfrecords_from_csv

Begin with the necessary imports 

After that, call the following function with the appropriate parameters.


    DATA_DIR - The directory path of the training dataset containing the image files.

    csv_path - The path of the csv file having image names and their labels

    outfile1name - The name of the output tfrecords file1 you would like to save as

    outfile2name - The name of the output tfrecords file2 you would like to save as

    split_size_ratio - The split ratio to divide the dataset into.

    IMAGE_SIZEThe dimensions of the IMAGE to be written in TFRecords Dataset. Default, (192,192)


    Stores 2 TFRecords Files upon successful execution.

create_split_tfrecords_data(data_dir, outfile1name, output1folder, outfile2name, output2folder, split_size_ratio,num_parts1 = 1, num_parts2 = 1, IMAGE_SIZE = (192,192), zip = False)

from quick_ml.tfrecords_maker import create_split_tfrecords_data

from quick_ml.tfrecords_maker import create_tfrecord_labeled

from quick_ml.tfrecords_maker import get_addrs_labels

create_tfrecord_labeled(addrs, labels, output_filename, IMAGE_SIZE = (192,192), num_parts = 1)

addrs, labels = get_addrs_labels(DATA_DIR)

create_tfrecord_labeled(addrs, labels, output_filename, IMAGE_SIZE = (192,192), num_parts = 1)

from quick_ml.tfrecords_maker import create_tfrecords_from_csv

from quick_ml.tfrecords_maker import create_split_tfrecords_from_csv

create_tfrecords_from_csv(data_dir, csv_path, outputfilename, IMAGE_SIZE = (192,192))

create_split_tfrecords_from_csv(DATA_DIR, csv_path, outfile1name, outfile2name, split_size_ratio, IMAGE_SIZE = (192,192))

Unlabeled Data

For unlabeled data, make sure to follow the following structure.

/ Data

      | -> file1
      | -> file2
      | -> file3
      | -> file4
      | -> fileN

where file1, file2, file3, fileN denote the unlabeled, uncategorized image files. The filenames serve as the Id which is paired with the Images as an identification.
This is usually used for test data creation(unknown, unclassified).

To make unlabeled TFRecords dataset, you would need create_tfrecord_unlabeled & get_addrs_ids.


First, obtain the image addresses (addrs) and image ids (ids) using get_addrs_ids in the tfrecords_maker module.




Unlabeled_Data_dir refers to the Dataset Folder which follows the structure of unlabeled dataset.

After getting the addrs & ids, pass the right parameters for the function to make the TFRecords Dataset for you.

    out_filename - name of the tfrecords outputfile name.
    addrs - the addrs of the images in the data folder. (can be obtained using get_addrs_ids())
    ids - the ids of the imahes in the data folder. (can be obtained using get_addrs_ids())
    IMAGE_SIZE - The Image Size of each image you want to have in the TFRecords dataset. Default, (192,192).

A TFRecords dataset with examples with 'image' as the first field & 'idnum' as the second field.

from quick_ml.tfrecords_maker import create_tfrecord_unlabeled

from quick_ml.tfrecords_maker import get_addrs_ids

addrs, ids = get_addrs_ids(Unlabeled_Data_Dir)

unlabeled_dataset = create_tfrecord_unlabeled(out_filename, addrs, ids, IMAGE_SIZE = (192,192))

Was this helpful?