Trained a stable diffusion model

The innovation in generative AI is fascinating and scary in equal measure.

It’s fascinating because AI increases access and makes it faster to product content. It empowers people to do things they couldn’t do before: for example, I’m not good at generating visual content, but I can do so by prompting with text. Secondly, it reduces the cost of producing content.

Generative AI is scary because of hype cycles. When there’s this much buzz around a piece of technology, it’s like to be a bubble. I’ve come to appreciate bubbles as a kind of natural evolution. People see this fancy new piece of technology. They try a bunch of things. Some work and some don’t. The ones that do stick and become businesses.

image

When I encounter a dichotomy like this one, I like to try it for myself. There’s no substitute for using the technology to solve a specific problem. Reading does not cut it because of bias. You are either too optimistic, or too dismissive, or both.

So today, we’re going to try something different. I’m going to try and solve a specific use case using generative AI.

Identifying a problem

The ideal problem to tackle with AI is one that is:

  1. Repetitive
  2. Takes a lot of time or costs lot of money
  3. Does not require high amounts of fidelity

We want repetitive tasks that cost a lot of money or take a lot of time because AI can provide a 10x improvement. We want problems that can be solved with low fidelity because content produced by AI is not always perfect (yet).

The problem

The first problem that came to mind was copy writing. Blogs for SaaS companies, product descriptions for e-commerce and social media content all require content. This kind of content, if augmented with human input, is an ideal use case for AI (as demonstrated by Jasper AI’s revenue and recent funding round).

I decided not to choose this problem. There’s far too many people attempting to this already: JasperAI, CopyAI and Lex are notable examples.

The second idea that came to mind was product photography. Shop online and you’ll see that every product has 4-5 different images. Apart from this, e-commerce owners need lots of content of their products.

image

I don’t think you can replace product photography. But I do think AI can help reduce the cost of it. I asked a friend how much it costs to shoot product photography:

On average, it costs £10 per product image.

Problem & solution

The problem statement:

Can AI help reduce the cost of producing product photographs?

The solution I have in mind:

  • A merchant or seller uploads a set of photographs
  • They describe what they want for their computer generated images
  • AI generates the images

To set expectations, the intent here is to see how everything works and figure out if there is potential. I do not expect to have something that is usable.

Generative AI

In order to get our AI model to spit out images on demand, we need to fine tune it for this use case.

Generative AI models like stable diffusion are trained on billions of images. At a high-level, they are trained to mimic the human brain. Given a set of inputs, they produce output. The innovation here is that the output is net new.

The process works roughly like this: a computer takes in the billion images that are labelled, it processes them as “data” and then associates those labels with the “data”. When you feed the algorithm an input (also known as a prompt) it uses what it’s learnt to produce a new image. This is an oversimplification of the process but works for now.

Off-the-shelf generative AI algorithms work well for content that is not specific. It’s not great when you want content for a specific product.

For example, I generated the image below using DreamStudio (based on Stable Diffusion) with the prompt: “A male model listening to music with Bose headphones and running on the beach.”

image

Not quite what I was expecting. It actually doesn’t have the product in the image at all.

Training the model

To overcome the above, you need to fine tune existing models like Stable Diffusion.

First, I needed a product to sample this for. Initially, I considered generating product images for a Coke can, but I decided against it. It seemed too easy!

I decided to go with Bose Headphones. I trained the model using 13 images of Bose headphones.

Here’s the result of different prompts:

Prompt: A female model wearing Bose Headphones, listening to music and walking in the park.

image

Prompt: A male model listening to music with Bose Headphones and running on the beach.

image

Prompt: A dog wearing Bose Headphones in the park.

image

This feels pretty revolutionary. Whilst I don’t think the images are usable, these are new images that I was able to achieve in a couple of hours.

The path to optimising these images is clear:

  1. It is recommended that you have 30-40 images for your use case. I was solving for speed and went with 15.
  2. If many of your images share similar features, you can expect to see these features in the output.
  3. Prompts are important. The more detail you provide the better the image gets.

With more effort, it feels like there is potential for AI to reduce the costs of content production.

The other thing that became quickly apparent was the size of the ecosystem forming around AI applications. Hugging Face lets you find and host ML models. Anyone can access them via an API. Replicate does the same. You might require a GPU (big computer hosted on the cloud) to run your model. And if you do, Google Collab lets you host your code and gives you access to free or paid GPUs.

Here’s a useful market map if you want to do more digging:

image

To close

Generative AI does seem to have immense potential. It helps people produce content they were otherwise unable to. It reduces the cost of production.

It also changes how we build things. In software, we are used to things being deterministic (i.e. 1 + 1 = 2). This is not the case with generative AI. The output won’t always be the same. This requires us to think differently about user interfaces, engineering and product.

I’m excited for what this technology is going to bring.