Shortcuts

Lightning DataModule

Datasets in PyTorch, Lightning and general Deep learning research have 4 main parts:

  1. A train split + dataloader

  2. A val split + dataloader

  3. A test split + dataloader

  4. A step to download, split, etc…

Step 4, also needs special care to make sure that it’s only done on 1 GPU in a multi-GPU set-up. In addition, there are other challenges such as models that are built using information from the dataset such as needing to know image dimensions or number of classes.

A datamodule simplifies all of these parts and integrates seamlessly into Lightning.

class LitModel(pl.LightningModule):

    def __init__(self, datamodule):
        c, w, h = datamodule.size()
        self.l1 = nn.Linear(128, datamodule.num_classes)
        self.datamodule = datamodule

    def prepare_data(self):
        self.datamodule.prepare_data()

    def train_dataloader(self)
        return self.datamodule.train_dataloader()

    def val_dataloader(self)
        return self.datamodule.val_dataloader()

    def test_dataloader(self)
        return self.datamodule.test_dataloader()

DataModules can also be used with plain PyTorch

from pl_bolts.datamodules import MNISTDataModule, CIFAR10DataModule

datamodule = CIFAR10DataModule(PATH)
train_loader = datamodule.train_dataloader()
val_loader = datamodule.train_dataloader()
test_loader = datamodule.train_dataloader()

An advantage is that you can parametrize the data of your LightningModule

model = LitModel(datamodule = CIFAR10DataModule(PATH))
model = LitModel(datamodule = ImagenetDataModule(PATH))

Or even bridge between SKLearn or numpy datasets

from sklearn.datasets import load_boston
from pl_bolts.datamodules import SklearnDataModule

X, y = load_boston(return_X_y=True)
datamodule = SklearnDataModule(X, y)

model = LitModel(datamodule)

DataModule Advantages

Datamodules have two advantages:

  1. You can guarantee that the exact same train, val and test splits can be used across models.

  2. You can parameterize your model to be dataset agnostic.

Example:

from pl_bolts.datamodules import STL10DataModule, CIFAR10DataModule

# use the same dataset on different models (with exactly the same splits)
stl10_model = LitModel(STL10DataModule(PATH))
stl10_model = CoolModel(STL10DataModule(PATH))

# or make your model dataset agnostic
cifar10_model = LitModel(CIFAR10DataModule(PATH))

Build a DataModule

Use this to build your own consistent train, validation, test splits.

Example:

from pl_bolts.datamodules import LightningDataModule

class MyDataModule(LightningDataModule):

    def __init__(self,...):

    def prepare_data(self):
        # download and do something to your data

    def train_dataloader(self, batch_size):
        return DataLoader(...)

    def val_dataloader(self, batch_size):
        return DataLoader(...)

    def test_dataloader(self, batch_size):
        return DataLoader(...)

Then use this in any model you want.

Example:

class LitModel(pl.LightningModule):

    def __init__(self, data_module=MyDataModule(PATH)):
        super().__init()
        self.dm = data_module

    def prepare_data(self):
        self.dm.prepare_data()

    def train_dataloader(self):
        return self.dm.train_dataloader()

    def val_dataloader(self):
        return self.dm.val_dataloader()

    def test_dataloader(self):
        return self.dm.test_dataloader()

DataModule class

class pl_bolts.datamodules.lightning_datamodule.LightningDataModule(train_transforms=None, val_transforms=None, test_transforms=None)[source]

Bases: object

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits and transforms across models.

Example:

class MyDataModule(LightningDataModule):

    def __init__(self):
        super().__init__()

    def prepare_data(self):
        # download, split, etc...

    def train_dataloader(self):
        train_split = Dataset(...)
        return DataLoader(train_split)

    def val_dataloader(self):
        val_split = Dataset(...)
        return DataLoader(val_split)

    def test_dataloader(self):
        test_split = Dataset(...)
        return DataLoader(test_split)

A DataModule implements 4 key methods

  1. prepare_data (things to do on 1 GPU not on every GPU in distributed mode)

  2. train_dataloader the training dataloader.

  3. val_dataloader the val dataloader.

  4. test_dataloader the test dataloader.

This allows you to share a full dataset without explaining what the splits, transforms or download process is.

classmethod add_argparse_args(parent_parser)[source]

Extends existing argparse by default LightningDataModule attributes.

Return type

ArgumentParser

classmethod from_argparse_args(args, **kwargs)[source]

Create an instance from CLI arguments.

Parameters
  • args (Union[Namespace, ArgumentParser]) – The parser or namespace to take arguments from. Only known arguments will be parsed and passed to the LightningDataModule.

  • **kwargs – Additional keyword arguments that may override ones in the parser or namespace. These must be valid Trainer arguments.

Example:

parser = ArgumentParser(add_help=False)
parser = LightningDataModule.add_argparse_args(parser)
module = LightningDataModule.from_argparse_args(args)
classmethod get_init_arguments_and_types()[source]

Scans the Trainer signature and returns argument names, types and default values.

Returns

(argument name, set with argument types, argument default value).

Return type

List with tuples of 3 values

abstract prepare_data(*args, **kwargs)[source]

Use this to download and prepare data. In distributed (GPU, TPU), this will only be called once. This is called before requesting the dataloaders:

Warning

Do not assign anything to the model in this step since this will only be called on 1 GPU.

Pseudocode:

model.prepare_data()
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()

Example:

def prepare_data(self):
    download_imagenet()
    clean_imagenet()
    cache_imagenet()
size(dim=None)[source]

Return the dimension of each input Either as a tuple or list of tuples

Return type

Union[Tuple, int]

abstract test_dataloader(*args, **kwargs)[source]

Implement a PyTorch DataLoader for training.

Return type

Union[DataLoader, List[DataLoader]]

Returns

Single PyTorch DataLoader.

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

You can also return a list of DataLoaders

Example:

def test_dataloader(self):
    dataset = MNIST(root=PATH, train=False, transform=transforms.ToTensor(), download=False)
    loader = torch.utils.data.DataLoader(dataset=dataset, shuffle=False)
    return loader
abstract train_dataloader(*args, **kwargs)[source]

Implement a PyTorch DataLoader for training.

Return type

DataLoader

Returns

Single PyTorch DataLoader.

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

def train_dataloader(self):
    dataset = MNIST(root=PATH, train=True, transform=transforms.ToTensor(), download=False)
    loader = torch.utils.data.DataLoader(dataset=dataset)
    return loader
abstract val_dataloader(*args, **kwargs)[source]

Implement a PyTorch DataLoader for training.

Return type

Union[DataLoader, List[DataLoader]]

Returns

Single PyTorch DataLoader.

Note

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

You can also return a list of DataLoaders

Example:

def val_dataloader(self):
    dataset = MNIST(root=PATH, train=False, transform=transforms.ToTensor(), download=False)
    loader = torch.utils.data.DataLoader(dataset=dataset, shuffle=False)
    return loader
Read the Docs v: 0.1.0
Versions
latest
stable
0.1.1
0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.