7 Best practices for using vision with Claude

Vision allows for a new mode of interaction with Claude. We’ve compiled a few tips for getting the best performance on your images. Before we get to that, let’s first setup the code we need to run the notebook.

import base64
from anthropic import Anthropic
from IPython.display import Image
from pyhere import here


client = Anthropic()
MODEL_NAME = "claude-3-5-sonnet-20240620"

def get_base64_encoded_image(image_path):
    with open(image_path, "rb") as image_file:
        binary_data = image_file.read()
        base_64_encoded_data = base64.b64encode(binary_data)
        base64_string = base_64_encoded_data.decode('utf-8')
        return base64_string

7.1 Applying traditional techniques to multimodal

You can fix hallucination issues with traditional prompt engineering techniques like role assignment. Let’s see an example of this:

Suppose I want Claude to count the number of dogs in this image:

Image(filename=here('img/misc/nine_dogs.jpg'))

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": get_base64_encoded_image(here('img/misc/nine_dogs.jpg'))}},
            {"type": "text", "text": "How many dogs are in this picture?"}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

There are 9 dogs in this picture. The image shows a group of diverse dogs sitting together in a grassy field with flowers in the background. The dogs vary in size, color, and breed, including what appear to be Border Collies, a Belgian Malinois or similar shepherd breed, and some smaller terrier-type dogs. They are all facing the camera, creating a charming group portrait of canines in a natural setting.

There’s only 9 dogs but Claude thinks there is 10! Let’s apply a little prompt engineering and and try again.

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": get_base64_encoded_image(here('img/misc/nine_dogs.jpg'))}},
            {"type": "text", "text": "You have perfect vision and pay great attention to detail which makes you an expert at counting objects in images. How many dogs are in this picture? Before providing the answer in <answer> tags, think step by step in <thinking> tags and analyze every part of the image."}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

<thinking>
Let's carefully count the dogs in this image from left to right:

1. Far left: A black dog with a white chest
2. Next: A grey/brown dog
3. Third: A black and white dog, possibly a Border Collie mix
4. Fourth: A brown and white dog, looks like a Border Collie
5. Fifth: A dark grey dog with pointy ears, possibly a Belgian Malinois or similar breed
6. Sixth: A black and white dog, clearly a Border Collie
7. Seventh: Another black and white dog, also appears to be a Border Collie
8. Eighth: A black dog
9. Far right: A light brown/tan dog

I've carefully scanned the image multiple times to ensure I haven't missed any dogs in the background or foreground. All dogs are sitting in a line, making them easy to count.
</thinking>

<answer>
There are 9 dogs in this picture.
</answer>

Great! After applying some prompt engineering to the prompt, we see that Claude now counts correctly that there is 9 dogs.

7.2 Visual prompting

Images as input allows for prompts to now be given within the image itself. Let’s take a look at some examples.

In this image, we write some text and draw an arrow on it. Let’s just pass this in to Claude with no accompanying text prompt.

Image(filename=here("img/misc/circle.png"))

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here("img/misc/circle.png"))}},
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

This image shows a simple geometric shape - a circle. Inside the circle, there is a straight line segment extending from the center to the edge of the circle. This line segment is labeled with the letter "r", which typically stands for radius in geometry.

The radius is a key feature of a circle, as it represents the distance from the center point to any point on the circle's circumference. It's fundamental to many calculations involving circles, such as determining the circle's area or circumference.

This diagram is commonly used in mathematics, particularly in geometry lessons, to illustrate the basic components of a circle. It's a clear, minimalist representation that focuses on the essential element of the radius without any additional complexity.

As you can see, Claude tried to describe the image as we didn’t give it a question. Let’s add a question to the image and pass it in again.

Image(filename=here("img/misc/labeled_circle.png"))

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here("img/misc/labeled_circle.png"))}},
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

To find the area of the circle, we need to use the formula:

Area = πr²

Where:
π (pi) is approximately 3.14159
r is the radius, which is given as 12

Let's plug these values into the formula:

Area = π * 12²
    = π * 144
    = 3.14159 * 144
    ≈ 452.389 square units

Rounding to a more practical number of decimal places:

Area ≈ 452.39 square units

Therefore, the area of the circle with radius 12 is approximately 452.39 square units.

We can also highlight specific parts of the image and ask questions about it.

What’s the difference between these two numbers?

Image(filename=here('img/misc/table.png'))

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here('img/misc/table.png'))}},
            {"type": "text", "text": "What’s the difference between these two numbers?"}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

The difference between the two highlighted numbers is $36,948.

The first highlighted number is $315,880, which represents the net sales for North America for the twelve months ended December 31, 2022.

The second highlighted number is $352,828, which represents the net sales for North America for the twelve months ended December 31, 2023.

To calculate the difference:

$352,828 - $315,880 = $36,948

This difference indicates an increase in net sales for North America from 2022 to 2023 over the twelve-month period.

7.3 Few-shot examples

Adding examples to prompts still improves accuracy with visual tasks as well. Let’s ask Claude to read a picture of a speedometer.

Image(filename=here('img/misc/140.png'))

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here('img/misc/140.png'))}},
            {"type": "text", "text": "What speed am I going?"}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

Based on the speedometer shown in the image, the vehicle is traveling at approximately 130 miles per hour. The yellow needle is pointing just past the 120 mph mark on the outer ring of the speedometer. This is an extremely high and dangerous speed for most roads and conditions. I would strongly advise slowing down immediately for safety reasons, as driving at such high speeds is typically illegal and poses severe risks to the driver, passengers, and others on the road.

Claude’s answer doesn’t look quite right here, it thinks we are going 140km/hour and not 140 miles/hour! Let’s try again but this time let’s add some examples to the prompt.

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here("img/misc/70.png"))}},
            {"type": "text", "text": "What speed am I going?"}
        ]
    },
    {
        "role": 'assistant',
        "content": [
            {"type": "text", "text": "You are going 70 miles per hour."}
        ]
    },
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here("img/misc/100.png"))}},
            {"type": "text", "text": "What speed am I going?"}
        ]
    },
    {
        "role": 'assistant',
        "content": [
            {"type": "text", "text": "You are going 100 miles per hour."}
        ]
    },
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here("img/misc/140.png"))}},
            {"type": "text", "text": "What speed am I going?"}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list, 
    temperature=0
)
print(response.content[0].text)

Based on the speedometer shown in the image, you are going approximately 140 miles per hour. The yellow needle is pointing very close to the 140 mph mark on the outer ring of the speedometer. This is an extremely high and dangerous speed for most road conditions. I would strongly advise reducing your speed immediately for safety reasons if this is an actual current reading.

Perfect! With those examples, Claude learned how to read the speed on the speedometer. Note though that few-shot prompting with images doesn’t always work but it is worth trying on your use case.

7.4 Multiple images as input

Claude can also accept and reason over multiple images at once within the prompt as well! For example, let’s say you had a really large image - like an image of a long receipt! We can split that image up into chunks and feed each one of those chunks into Claude.

Image(filename=here('img/misc/receipt1.png'))

Image(filename=here('img/misc/receipt2.png'))

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here('img/misc/receipt1.png'))}},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image(here('img/misc/receipt2.png'))}},
            {"type": "text", "text": "Output the name of the restaurant and the total."}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

The name of the restaurant is The Breakfast Club, and the total on the receipt is $78.86.

7.5 Object identification from examples

With image input, you can pass in other images to the prompt and Claude will use that information to answer questions. Let’s see an example of this.

Suppose we were trying to identify the type of pant in an image. We can provide Claude some examples of different types of pants in the prompt.

Image(filename='../images/best_practices/officer_example.png')

message_list = [
    {
        "role": 'user',
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image("../images/best_practices/wrinkle.png")}},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image("../images/best_practices/officer.png")}},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image("../images/best_practices/chinos.png")}},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": get_base64_encoded_image("../images/best_practices/officer_example.png")}},
            {"type": "text", "text": "These pants are (in order) WRINKLE-RESISTANT DRESS PANT, ITALIAN MELTON OFFICER PANT, SLIM RAPID MOVEMENT CHINO. What pant is shown in the last image?"}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=message_list
)
print(response.content[0].text)

The last image shows a person wearing light gray wool dress pants or trousers paired with brown leather dress shoes or loafers. Based on the texture and drape of the fabric, these appear to be the Italian Melton Officer pants that were shown in the second product image.