HuggingFace 🤗 - Zero Shot Image Classification

Introduction

It's been a while since my last post so I thought I'd write about something I've been working on recently.

NOTE as well, this post is the first to use an AI generated header photo, felt fitting. Judge accordingly.

I've been (trying to) work a lot with HuggingFace and their transformers library. Lot's of projects and people want to integrate machine learning into their projects and in my opinion, HuggingFace is the best place to start. It's a great platform - view my previous post on them for more information as well.

One of the tasks I think is pretty easily integrated into a lot of projects, and current models are pretty good at is Image Classification.

It's a pretty powerful tool, however, it can be a bit daunting to get started with.

One of the scariest parts about it is the data and the training - although that may just be the software engineer in me. I'm not a data scientist, so while I can do the whole 'clean, preprocess, train, test, etc.' it's not super fast for me.

This is where Zero Shot Image Classification comes in.

What is 'Zero Shot'

Zero Shot Image Classification is simply a way to classify images without having to train a model on a specific dataset.

You just give the model an image and a list of labels and it will tell you which label it thinks is most appropriate.

You can control the model a bit as well and it will do its absolute best to classify the image based on the labels you give it. It can also 'kind of' learn a bit more about the labels you give it as well over time but not nearly as well as a model you train yourself explicitly

There's a lot more to it than that, but that's the applicable part for most people just trying to work with it.

For a more detailed look at Zero Shot, check out the HuggingFace docs and/or the 'Resources' section below.

Playground

So for this post and my personal testing, I've created a 'playground' of sorts for Zero Shot Image Classification. Feel free to view it here

Worth noting as well the HuggingFace docs have a pretty great tutorial I followed to get started with this.

Setup

To get started you'll generally want to

install packages

pipenv install pillow transformers torch

* PyTorch is what I personally use, but you can use TensorFlow as well.
    * The question of PyTorch vs. TensorFlow is a fairly big one and I'm not going to get into it here - but definitely worth reading into a bit!
    * See "Resources" below for more information on PyTorch vs. TensorFlow
* You may need to install PyTorch as well - see "Resources" below for more information on that

write test script - see basic.py in repo for an example but an idea of the minimal code could even be:

from transformers import pipeline
from PIL import Image
import requests

model_name = "openai/clip-vit-large-patch14-336"
classifier = pipeline("zero-shot-image-classification", model=model_name)

url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640"
image_to_classify = Image.open(requests.get(url, stream=True).raw)
labels_for_classification = ["owl", "bird", "cat", "dog", "car"]
scores = classifier(image_to_classify, candidate_labels=labels_for_classification)

for obj in scores:
    print(f"{obj['label']}: {obj['score']}")

If you run this it should give you something like:

owl: 0.9953024387359619
bird: 0.0046501727774739265
car: 2.145067810488399e-05
cat: 1.9642637198558077e-05
dog: 6.136761385278078e-06

Incredible! You've just done Zero Shot Image Classification!

Labeling

Pretty cool right? It raises some questions though: - How do you know what labels to give it? - How many labels should you give it? - What if you give it the wrong labels?

The list goes on. Labels are a big part of your success here.

Labeling Ideas

Before we get into the nitty gritty of labeling performance and strategies, I think it's useful to briefly talk about where you can get labels from.

This is a pretty big question depending on how you're using this tool, but here are a few ideas:

Hardcoded list
- Just a list of labels you think are appropriate for your images.
- This works best if you're using the model for a specific purpose and have a pretty good idea what you're classifying.
- This can also help solve issues with label 'moderation' or 'curation' - you can control what labels are used and how they're used.
User generated
- Depending on your project, you can generate labels from users either directly or indirectly.
  - This can look very different depending on your project. For a food app I worked on for example we were able to use the menu items input by restaurants to generate labels.
- This can be a great way to get a lot of labels quickly and more importantly to easily figure out which labels are relevant to your problem set.
- This does have some issues with moderation and curation - you'll need to figure out how to handle that - but all the standard issues with user generated content apply here.
Multiple label sets
- Depending on your project and classification use, it may make sense to have multiple sets of labels.
  - e.g., 'food' or 'people' or 'animals' or whatever is appropriate for your project.
- This can be a great way to get a general idea of what the image is and then refine your labels based on that.
- This will make more sense in the next section.

Labeling Strategies

I played around a lot with different labeling strategies and found it to be really useful for getting a better intuitive understanding of how the model works.

These aren't formal strategies or methodologies but just a few ways I thought about labeling images.

Make a list of all the labels you can think of
- That's it. Put it in a file - labels.txt - you can even ask chat gpt for some appropriate labels to help fill in.
"Categories" - e.g., "animals", "cars", "food", etc.
- The idea is you have a few files - animals.txt, cars.txt, food.txt, etc. - and you put labels in each file that are relevant to that category.
- If you have a general idea what the image is, you can use the appropriate category file to label it more accurately than just a general list of labels.
Combine all the lists into a massive list
- Pretty self explanatory, but you can combine all the lists into one massive list and use that to label images.
Classify the image twice
- The first time you classify it, use the category names as labels
- The second time you classify it, use the specific lists based on the highest scoring category from the first classification

View main.py in the repo for the actual code on how I implemented these strategies into a little Python text program. Note it's not really split into methods like this, I've tried to draw some analogies to how I thought about it but the script was more of a progression of ideas and methods than a formal methodology.

Test Test Test

So that all sounds great, but how do you know if you're doing it right?

How do these strategies compare? How do you know if you're getting better at labeling images?

Well I wrote a test script to help with that - also in the repo, test.py - I'll let you play with it yourself.

It gives generally statistics for a set of hardcoded images how these different labeling strategies compare. It's not perfect, but it's a good start and will give you an idea of how you can play with this.

Some takeaways I learned from this testing were:

Specificity is hard
- The more specific your labels are, the worse the model will perform.
- If you can afford to keep your label count low and labels general, you'll probably have better results.
Categories are useful
- Using categories to classify images can be a great way to get started and then refine your labels.
- It can also be a great way to get a general idea of what the image is and is maybe the more appropriate usage here.
Combining lists is useful
- Combining lists can be a great way to classify against a bunch of labels even despite the performance hit.
- This can hurt performance and specificity though. Exactly how/when this occurs will depend on the images you're classifying, labels you're using, model you're using, etc.
Classifying twice is useful
- Classifying twice can be a great way to really get a good answer. Especially if you're concerned with accuracy/high confidence.
- This can definitely be slow but can be great if you need really specific labels and answers.

Again the above is going to depend a lot on your images, labels, model, etc. but it's a good starting point for how to think about labeling images for Zero Shot Image Classification.

Is Training For Chumps?

Well, no, it's pretty useful and you'll probably find the limitations of Zero Shot pretty quickly. However, it's a great way to get started and see if image classification generally is worth it for you and your project.

I think Zero Shot is great for MVPs and for some projects entirely. Over time, if you're building a business off of your ML model or how it performs, you'll probably want to train it yourself.

However, for a lot of projects, Zero Shot is a great way to at least get started.

Worth noting as well, zero shot generally performs worse the more labels you give it and definitely the more specific said labels get. So, if you're trying to classify between 'cat' and 'dog' you're probably fine. If you're trying to classify between 'german shepherd', 'golden retriever', 'labrador', etc. you might have a bad time.

Conclusion

That's it!

Zero Shot Image Classification is a great way to get started with image classification and machine learning in general.

It's a great tool to have in your toolbox and can be a great way to get started with a project. It's also a bit of a gateway drug to wanting to train your own models and get more into the nitty gritty of machine learning.

Thanks for reading, please leave any comments or questions below!

Resources

Zero Shot Playground: A playground for experimenting with zero-shot classification.

HuggingFace Tutorial: A tutorial on zero-shot image classification using the Transformers library.

Install PyTorch:

PyTorch vs. TensorFlow: