288 lines
14 KiB
Markdown
288 lines
14 KiB
Markdown
# StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery (ICCV 2021 Oral)
|
||
|
||
[Run this model on Replicate](https://replicate.ai/orpatashnik/styleclip)
|
||
|
||
Optimization: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/optimization_playground.ipynb)
|
||
Mapper: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/mapper_playground.ipynb)
|
||
|
||
Global directions Torch: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global_torch.ipynb)
|
||
Global directions TF1: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb)
|
||
|
||
|
||
<p align="center">
|
||
<a href="https://www.youtube.com/watch?v=5icI0NgALnQ"><img src='https://github.com/orpatashnik/StyleCLIP/blob/main/img/StyleCLIP_gif.gif' width=600 ></a>
|
||
|
||
Full Demo Video: <a href="https://www.youtube.com/watch?v=5icI0NgALnQ"><img src="https://img.shields.io/badge/-YouTube-red?&style=for-the-badge&logo=youtube&logoColor=white" height=20></a> ICCV Video <a href="https://www.youtube.com/watch?v=PhR1gpXDu0w"><img src="https://img.shields.io/badge/-YouTube-red?&style=for-the-badge&logo=youtube&logoColor=white" height=20></a>
|
||
|
||
</p>
|
||
|
||
|
||
|
||
![](img/teaser.png)
|
||
|
||
> **StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery**<br>
|
||
> Or Patashnik*, Zongze Wu*, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski <br>
|
||
> *Equal contribution, ordered alphabetically <br>
|
||
> https://arxiv.org/abs/2103.17249 <br>
|
||
>
|
||
>**Abstract:** Inspired by the ability of StyleGAN to generate highly realistic
|
||
images in a variety of domains, much recent work has
|
||
focused on understanding how to use the latent spaces of
|
||
StyleGAN to manipulate generated and real images. However,
|
||
discovering semantically meaningful latent manipulations
|
||
typically involves painstaking human examination of
|
||
the many degrees of freedom, or an annotated collection
|
||
of images for each desired manipulation. In this work, we
|
||
explore leveraging the power of recently introduced Contrastive
|
||
Language-Image Pre-training (CLIP) models in order
|
||
to develop a text-based interface for StyleGAN image
|
||
manipulation that does not require such manual effort. We
|
||
first introduce an optimization scheme that utilizes a CLIP-based
|
||
loss to modify an input latent vector in response to a
|
||
user-provided text prompt. Next, we describe a latent mapper
|
||
that infers a text-guided latent manipulation step for
|
||
a given input image, allowing faster and more stable textbased
|
||
manipulation. Finally, we present a method for mapping
|
||
a text prompts to input-agnostic directions in StyleGAN’s
|
||
style space, enabling interactive text-driven image
|
||
manipulation. Extensive results and comparisons demonstrate
|
||
the effectiveness of our approaches.
|
||
|
||
|
||
## Description
|
||
Official Implementation of StyleCLIP, a method to manipulate images using a driving text.
|
||
Our method uses the generative power of a pretrained StyleGAN generator, and the visual-language power of CLIP.
|
||
In the paper we present three methods:
|
||
- Latent vector optimization.
|
||
- Latent mapper, trained to manipulate latent vectors according to a specific text description.
|
||
- Global directions in the StyleSpace.
|
||
|
||
|
||
## Updates
|
||
**31/10/2022** Add support for global direction with torch implementation
|
||
|
||
**15/8/2021** Add support for StyleSpace in optimization and latent mapper methods
|
||
|
||
**6/4/2021** Add mapper training and inference (including a jupyter notebook) code
|
||
|
||
**6/4/2021** Add support for custom StyleGAN2 and StyleGAN2-ada models, and also custom images
|
||
|
||
**2/4/2021** Add the global directions code (a local GUI and a colab notebook)
|
||
|
||
**31/3/2021** Upload paper to arxiv, and video to YouTube
|
||
|
||
**14/2/2021** Initial version
|
||
|
||
## Setup (for all three methods)
|
||
For all the methods described in the paper, is it required to have:
|
||
- Anaconda
|
||
- [CLIP](https://github.com/openai/CLIP)
|
||
|
||
Specific requirements for each method are described in its section.
|
||
To install CLIP please run the following commands:
|
||
```shell script
|
||
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=<CUDA_VERSION>
|
||
pip install ftfy regex tqdm gdown
|
||
pip install git+https://github.com/openai/CLIP.git
|
||
```
|
||
|
||
|
||
## Editing via Latent Vector Optimization
|
||
|
||
### Setup
|
||
|
||
Here, the code relies on the [Rosinality](https://github.com/rosinality/stylegan2-pytorch/) pytorch implementation of StyleGAN2.
|
||
Some parts of the StyleGAN implementation were modified, so that the whole implementation is native pytorch.
|
||
|
||
In addition to the requirements mentioned before, a pretrained StyleGAN2 generator will attempt to be downloaded, (or manually download from [here](https://drive.google.com/file/d/1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT/view?usp=sharing)).
|
||
|
||
### Usage
|
||
|
||
Given a textual description, one can both edit a given image, or generate a random image that best fits to the description.
|
||
Both operations can be done through the `main.py` script, or the `optimization_playground.ipynb` notebook ([![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/optimization_playground.ipynb)).
|
||
|
||
#### Editing
|
||
To edit an image set `--mode=edit`. Editing can be done on both provided latent vector, and on a random latent vector from StyleGAN's latent space.
|
||
It is recommended to adjust the `--l2_lambda` according to the desired edit.
|
||
|
||
#### Generating Free-style Images
|
||
To generate a free-style image set `--mode=free_generation`.
|
||
|
||
## Editing via Latent Mapper
|
||
Here, we provide the code for the latent mapper. The mapper is trained to learn *residuals* from a given latent vector, according to the driving text.
|
||
The code for the mapper is in `mapper/`.
|
||
|
||
### Setup
|
||
As in the optimization, the code relies on [Rosinality](https://github.com/rosinality/stylegan2-pytorch/) pytorch implementation of StyleGAN2.
|
||
In addition the the StyleGAN weights, it is neccessary to have weights for the facial recognition network used in the ID loss.
|
||
The weights can be downloaded from [here](https://drive.google.com/file/d/1KW7bjndL3QG3sxBbZxreGHigcCCpsDgn/view?usp=sharing).
|
||
|
||
The mapper is trained on latent vectors. It is recommended to train on *inverted real images*.
|
||
To this end, we provide the CelebA-HQ that was inverted by e4e:
|
||
[train set](https://drive.google.com/file/d/1gof8kYc_gDLUT4wQlmUdAtPnQIlCO26q/view?usp=sharing), [test set](https://drive.google.com/file/d/1j7RIfmrCoisxx3t-r-KC02Qc8barBecr/view?usp=sharing).
|
||
|
||
### Usage
|
||
|
||
#### Training
|
||
- The main training script is placed in `mapper/scripts/train.py`.
|
||
- Training arguments can be found at `mapper/options/train_options.py`.
|
||
- Intermediate training results are saved to opts.exp_dir. This includes checkpoints, train outputs, and test outputs.
|
||
Additionally, if you have tensorboard installed, you can visualize tensorboard logs in opts.exp_dir/logs.
|
||
Note that
|
||
- To resume a training, please provide `--checkpoint_path`.
|
||
- `--description` is where you provide the driving text.
|
||
- If you perform an edit that is not supposed to change "colors" in the image, it is recommended to use the flag `--no_fine_mapper`.
|
||
|
||
Example for training a mapper for the moahwk hairstyle:
|
||
```bash
|
||
cd mapper
|
||
python train.py --exp_dir ../results/mohawk_hairstyle --no_fine_mapper --description "mohawk hairstyle"
|
||
```
|
||
All configurations for the examples shown in the paper are provided there.
|
||
|
||
#### Inference
|
||
- The main inferece script is placed in `mapper/scripts/inference.py`.
|
||
- Inference arguments can be found at `mapper/options/test_options.py`.
|
||
- Adding the flag `--couple_outputs` will save image containing the input and output images side-by-side.
|
||
|
||
Pretrained models for variuos edits are provided. Please refer to `utils.py` for the complete links list.
|
||
|
||
We also provide a notebook for performing inference with the mapper Mapper notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/mapper_playground.ipynb)
|
||
|
||
## Editing via Global Direction
|
||
|
||
Here we provide GUI for editing images with the global directions.
|
||
We provide both a jupyter notebook [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb),
|
||
and the GUI used in the [video](https://www.youtube.com/watch?v=5icI0NgALnQ).
|
||
For both, the linear direction are computed in **real time**.
|
||
The code is located at `global_directions/`.
|
||
|
||
|
||
### Setup
|
||
Here, we rely on the [official](https://github.com/NVlabs/stylegan2) TensorFlow implementation of StyleGAN2.
|
||
|
||
It is required to have TensorFlow, version 1.14 or 1.15 (`conda install -c anaconda tensorflow-gpu==1.14`).
|
||
|
||
### Usage
|
||
|
||
|
||
#### Local GUI
|
||
|
||
To start the local GUI please run the following commands:
|
||
|
||
```shell script
|
||
cd global_directions
|
||
|
||
# input dataset name
|
||
dataset_name='ffhq'
|
||
|
||
# pretrained StyleGAN2 model from standard [NVlabs implementation](https://github.com/NVlabs/stylegan2) will be download automatically.
|
||
# pretrained StyleGAN2-ada model could be download from https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/ .
|
||
# for custom StyleGAN2 or StyleGAN2-ada model, please place the model under ./StyleCLIP/global_directions/model/ folder.
|
||
|
||
|
||
# input prepare data
|
||
python GetCode.py --dataset_name $dataset_name --code_type 'w'
|
||
python GetCode.py --dataset_name $dataset_name --code_type 's'
|
||
python GetCode.py --dataset_name $dataset_name --code_type 's_mean_std'
|
||
|
||
# preprocess (this may take a few hours).
|
||
# we precompute the results for StyleGAN2 on ffhq, StyleGAN2-ada on afhqdog, afhqcat. For these model, we can skip the preprocess step.
|
||
python SingleChannel.py --dataset_name $dataset_name
|
||
|
||
# generated image to be manipulated
|
||
# this operation will generate and replace the w_plu.npy and .jpg images in './data/dataset_name/' folder.
|
||
# if you you want to keep the original data, please rename the original folder.
|
||
# to use custom images, please use e4e encoder to generate latents.pt, and place it in './data/dataset_name/' folder, and add --real flag while running this function.
|
||
# you may skip this step if you want to manipulate the real human faces we prepare in ./data/ffhq/ folder.
|
||
python GetGUIData.py --dataset_name $dataset_name
|
||
|
||
# interactively manipulation
|
||
python PlayInteractively.py --dataset_name $dataset_name
|
||
```
|
||
|
||
As shown in the video, to edit an image it is requires to write a _neutral text_ and a _target text_.
|
||
To operate the GUI, please do the following:
|
||
- Maximize the window size
|
||
- Double click on the left square to choose an image. The images are taken from `global_directions/data/ffhq`, and the corresponding latent vectors are in `global_directions/data/ffhq/w_plus.npy`.
|
||
- Type a neutral text, then press enter
|
||
- Modify the target text so that it will contain the target edit, then press enter.
|
||
|
||
You can now play with:
|
||
- *Manipulation strength* - positive values correspond to moving along the target direction.
|
||
- *Disentanglement threshold* - large value means more disentangled edit, just a few channels will be manipulated so only the target attribute will change (for example, grey hair). Small value means less disentangled edit, a large number of channels will be manipulated, related attributes will also change (such as wrinkle, skin color, glasses).
|
||
|
||
##### Examples:
|
||
|
||
| Edit | Neutral Text | Target Text |
|
||
| --- | --- | --- |
|
||
| Smile | face | smiling face |
|
||
| Gender | female face | male face |
|
||
| Blonde hair | face with hair | face with blonde hair |
|
||
| Hi-top fade | face with hair | face with Hi-top fade hair |
|
||
| Blue eyes | face with eyes | face with blue eyes |
|
||
|
||
More examples could be found in the [video](https://www.youtube.com/watch?v=5icI0NgALnQ) and in the paper.
|
||
|
||
|
||
##### Pratice Tips:
|
||
In the terminal, for every manipulation, the number of channels being manipulated is printed (the number is controlled by the attribute (neutral, target) and the disentanglement threshold).
|
||
|
||
1. For color transformation, usually 10-20 channels is enough. For large structure change (for example, Hi-top fade), usually 100-200 channels are required.
|
||
2. For an attribute (neutral, target), if you give a low disentanglement threshold, there are just few channels (<20) being manipulated, and usually it is not enough for performing the desired edit.
|
||
|
||
|
||
#### Notebook
|
||
Open the notebook in colab and run all the cells. In the last cell you can play with the image.
|
||
|
||
`beta` corresponds to the *disentanglement threshold*, and `alpha` to the *manipulation strength*.
|
||
|
||
After you set the desired set of parameters, please run again the last cell to generate the image.
|
||
|
||
## Editing Examples
|
||
|
||
In the following, we show some results obtained with our methods.
|
||
All images are real, and were inverted into the StyleGAN's latent space using [e4e](https://github.com/omertov/encoder4editing).
|
||
The driving text that was used for each edit appears below or above each image.
|
||
|
||
#### Latent Optimization
|
||
|
||
![](img/me.png)
|
||
![](img/ariana.png)
|
||
![](img/federer.png)
|
||
![](img/styles.png)
|
||
|
||
#### Latent Mapper
|
||
|
||
![](img/mapper_hairstyle.png)
|
||
|
||
#### Global Directions
|
||
|
||
![](img/global_example_1.png)
|
||
![](img/global_example_2.png)
|
||
![](img/global_example_3.png)
|
||
![](img/global_example_4.png)
|
||
|
||
## Related Works
|
||
|
||
The global directions we find for editing are direction in the _S Space_, which was introduced and analyzed in [StyleSpace](https://arxiv.org/abs/2011.12799) (Wu et al).
|
||
|
||
To edit real images, we inverted them to the StyleGAN's latent space using [e4e](https://arxiv.org/abs/2102.02766) (Tov et al.).
|
||
|
||
The code strcuture of the mapper is heavily based on [pSp](https://github.com/eladrich/pixel2style2pixel).
|
||
|
||
## Citation
|
||
|
||
If you use this code for your research, please cite our paper:
|
||
|
||
```
|
||
@InProceedings{Patashnik_2021_ICCV,
|
||
author = {Patashnik, Or and Wu, Zongze and Shechtman, Eli and Cohen-Or, Daniel and Lischinski, Dani},
|
||
title = {StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery},
|
||
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
|
||
month = {October},
|
||
year = {2021},
|
||
pages = {2085-2094}
|
||
}
|
||
```
|