Motivation & Contribution

종래의 CV task들에 대해서는 사전에 정의된 category에 대해서만 훈련되어왔다. 이런 방법은 확장성이 적어 새로운 category에 대해서는 데이터를 새로 모아야 된다는 어려움이 있으며 일반성을 저해한다.

CLIP은 웹 상에서의 Image-Text 쌍을 이용한 4억개 규모 데이터 셋을 활용하여 이미지의 caption으로 학습한다. 이를 통해 자연어로 통제가능한 멀티모달 모델을 가능하게 해 zero-shot이 가능해졌다.

Approach

Natural Language Supervision

supervision learning이 가능하도록, 자연어 자체를 label로 두었다. 이 방법은 추가적인 labeling을 안해도 된다는 장점과 더불어 이미지와 텍스트의 representation을 함께 할 수 있다.

Creating a Sufficiently Large Dataset

MS-COCO, Visual Genome : High quality, small
YFCC100M : 100M photos, varying quality
CLIP에서 수집한 데이터셋 : 400M의 image-text pair

Selecting an Efficient Pre-Training Method

한 배치내에서 이미지 pair는 유사도를 최대로, 다른 쌍과는 유사도를 최소화하여 학습한다

Choosing and Scaling a Model

Image와 Text에 대한 encoder를 선택해야한다.

Image Encoder는 (1)ResNet50과 유사한 ResNet-D 버전과 (2)ViT를 거의 수정없이 사용했다.

Text Encoder는 Transformer를 사용했다.

Experiments

Using CLIP for zero-shot transfer

“A photo of {classes}”를 사용하여 유사도를 통해 바로 고를 수 있음.

수식어구를 추가하여 성능을 높일 수 있음(예 : A photo of a {class}, a type of pet”

Analysis of zero-shot CLIP performance

CLIP과 ResNet50과의 비교

Examples

Reference

https://arxiv.org/pdf/2103.00020.pdf

https://inforience.net/2021/02/09/clip_visual-model_pre_training/

https://simonezz.tistory.com/88

https://distill.pub/2021/multimodal-neurons/

https://greeksharifa.github.io/computer vision/2021/12/19/CLIP/

저작자표시 비영리 변경금지

'인공지능 > CV' 카테고리의 다른 글

[논문리뷰]Dall-E : Zero-Shot Text-to-Image Generation (0)	2022.07.01
[논문리뷰] CoCa: Contrastive Captioners are Image-Text Foundation Models (0)	2022.06.06
인공지능이 만드는 폰트 [ HAN2HAN : Hangul Font Generation] (1)	2021.11.13
GAN에서의 미분 (Pytorch) (0)	2021.10.13
YOLOv3를 이용한 턱스크찾기 프로젝트 (3)	2021.05.01

MINED

[논문리뷰]CLIP : Learning Transferable Visual Models From Natural Language Supervision

Motivation & Contribution

Approach

Natural Language Supervision

Creating a Sufficiently Large Dataset

Selecting an Efficient Pre-Training Method

Choosing and Scaling a Model

Experiments

Using CLIP for zero-shot transfer

Analysis of zero-shot CLIP performance

Examples

Reference

'인공지능 > CV' 카테고리의 다른 글

댓글

티스토리툴바

[논문리뷰]CLIP : Learning Transferable Visual Models From Natural Language Supervision

Motivation & Contribution

Approach

Natural Language Supervision

Creating a Sufficiently Large Dataset

Selecting an Efficient Pre-Training Method

Choosing and Scaling a Model

Experiments

Using CLIP for zero-shot transfer

Analysis of zero-shot CLIP performance

Examples

Reference

'인공지능 > CV' 카테고리의 다른 글

관련글

댓글

티스토리툴바