CLIP is trained with huge of (image, caption) paired data. Therefore we can assume that the model, CLIP, has its “general” knowledge between image and text captured from nature.
Consequently, CLIP is the most beloved model for zero-shot image or text tasks. However it is not feasible to apply for specific “subtle” tasks. For instance, classifying what type of wallpaper fault an image is certainly hard for CLIP as it is not “general”.
Therefore we need to fine-tune CLIP model to apply it for specific tasks. It may is done by following **Finetune like you pretrain: Improved finetuning of zero-shot vision models.**
Here is the things.
conda env create -f environments.yaml
Create your dataset for classification.
The dataset has to follow below structure.
train_dataset_dir
├── class0
├── class1
├── class2
...
├── classN-3
├── classN-2
└── classN-1
Open cfg.py and write a path of train_dataset_dir
(e.g. /home/path/to/dset)
Open info/class_descriptions.yaml and Write descriptions for “every” each class as follow
"가구수정": separation of wallpaper that is attached next to furniture with compartments, a defect that occurs in places like built-in wardrobes or drawers
"걸레받이수정": gap that has occurred between the mopboard and the wallpaper, mopboard is a material used to connect the side of the wallpaper and the floor, and this describes a defect that has occurred around that area
"곰팡이": blue mold that occurs between the wallpaper surface or moldings and the wallpaper. The defect arises over time in damp conditions or due to water leakage
...
classK: descrption for classK
...
"틈새과다": excessive gaps that have occurred between the wallpaper surface and moldings
"피스": tearing that occurred in the wallpaper surface due to improperly installed screws
"훼손": damage to the wallpaper surface itself and the damage occurring between the wallpaper and moldings
Train!
python train.py
class CFG():
# model
batch_size = 1 # for now, batch size is fixed to 1
img_transform_size_W = img_transform_size_H = 512
num_classes = -1 # automatically calculated by train_dataset_dir
label_smoothing = 0.1 # use label smoothing
sim_weight = 0.5 # weight that multiplied to similarity loss proposed in paper CLIP.
fc_weight = 0.5 # weight that multiplied to classificatio loss
# optimizer setting you should
lr = 1e-6
optim_betas = (0.9,0.98)
optim_eps = 1e-8
optim_weight_decay = 0.05
temperature = 1.072508 # (exp(t)), t=0.07 from CLIP paper
# dataset
test_size = 0.2
train_transforms = transforms.Compose([
transforms.Resize((img_transform_size_W, img_transform_size_H)),
transforms.TrivialAugmentWide(),
transforms.ToTensor(),
])
val_transforms = transforms.Compose([
transforms.Resize((img_transform_size_W, img_transform_size_H)),
transforms.ToTensor(),
])
train_dataset_dir = "/path/to/training/dataset/of/yours"
class_description_yaml_file = "/path/to/class_description/yaml/written/in/step4"