Multimodal Prompts

The Gemini family of models is inherently multimodal, meaning it can reason across various types of input data, including text, images, videos, and audio. This capability allows you to build applications that go beyond simple text prompts. In this section, we'll dive into how to design effective multimodal prompts to get the best results, such as asking Gemini to analyze an image and generate a related text response, or to process a combination of text and video for a more complex task.

What are Multimodal Prompts?

Multimodal prompts allow you to combine different data types—like an image and a text query—into a single request to the model. Instead of just describing what you want in words, you can show the model what you mean. This is incredibly powerful for tasks like visual question answering, where you can ask a question about the content of an image, or for creative applications where you might want to generate a story inspired by a picture.

The API is designed to handle this by accepting an array of "parts" in the request body, where each part can be a text string or inline data (like a base64-encoded image).

Example: Analyzing an Image with a Text Prompt

Let's look at a simple Python example that sends an image and a text prompt to the Gemini API to ask a question about the image's content. This code uses a placeholder for the base64-encoded image data, which you would replace with your own image.

import base64
import json
import requests

def get_image_base64(image_path):
    """Encodes a local image file to a base64 string."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def analyze_image_with_prompt(image_path, text_prompt):
    """
    Sends a multimodal prompt to the Gemini API with an image and text.
    
    Args:
        image_path (str): The local path to the image file.
        text_prompt (str): The text question to ask about the image.
    
    Returns:
        str: The generated text response from the model.
    """
    api_key = "" # The API key will be provided automatically in the Canvas environment
    api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-05-20:generateContent?key={api_key}"

    # Get base64 encoded image data
    base64_image_data = get_image_base64(image_path)
    
    payload = {
        "contents": [
            {
                "role": "user",
                "parts": [
                    {
                        "inlineData": {
                            "mimeType": "image/jpeg",  # Change mimeType as needed
                            "data": base64_image_data
                        }
                    },
                    {
                        "text": text_prompt
                    }
                ]
            }
        ]
    }
    
    headers = {
        "Content-Type": "application/json"
    }

    try:
        response = requests.post(api_url, headers=headers, data=json.dumps(payload))
        response.raise_for_status() # Raise an exception for bad status codes
        
        result = response.json()
        
        if result and result.get('candidates'):
            return result['candidates'][0]['content']['parts'][0]['text']
        else:
            return "No text was generated."
            
    except requests.exceptions.RequestException as e:
        return f"API request failed: {e}"

# --- How to use the function ---
# Replace 'path/to/your/image.jpg' with a real image file path.
# image_file = 'path/to/your/image.jpg'
# question = 'What is the main subject of this image and what is it doing?'
# response_text = analyze_image_with_prompt(image_file, question)
# print(f"Gemini's response: {response_text}")

Best Practices for Multimodal Prompts

Be Specific: Just like with text-only prompts, the more detail you provide about what you want the model to do with the image, the better the result will be.
Place Text After Images: For a single image, it's often a good practice to place the image part before the text part in the contents array. This helps the model process the visual input first, then apply the instructions from the text.
Use the Right MIME Type: Ensure the mimeType in your inlineData part accurately reflects the image file type (e.g., image/jpeg, image/png, image/webp).