Creative Content Generation: Learn to Use Gemini to Generate and Animate Images with Models like Imagen 4 and Veo 2

Welcome to the next chapter in our "Practical Applications" series! So far, we've focused on using Gemini for text-based tasks like conversation and research. Now, we'll dive into the exciting world of multimodal AI, where you can use Gemini to generate and animate visual content from simple text descriptions.

The Gemini family of models includes powerful tools for creative tasks, such as Imagen for high-quality image generation and Veo for creating high-fidelity videos. While we will primarily focus on image generation with a practical code example, we'll also explore the concepts behind generating more complex content like animations and videos.

Generating Images with the Gemini API and Imagen

Generating an image from a text prompt, often called "text-to-image," is a core capability of models like Imagen. The process is simple: you provide a detailed description of the image you want, and the model creates it for you. The key to getting a great result is the quality of your prompt.

Prompt Engineering for Image Generation

Just like with text generation, a good prompt for image generation is clear, descriptive, and specific. Consider including:

Subject: What is the main object or character in the image? (e.g., "A majestic golden retriever...")
Style: What artistic style should the image be in? (e.g., "...in the style of a watercolor painting.")
Setting: Where is the scene taking place? (e.g., "...running through a field of sunflowers.")
Mood/Lighting: What is the overall tone or lighting of the image? (e.g., "The sun is setting, casting a warm, golden light.")

Putting it all together, a great prompt might be: "A majestic golden retriever in the style of a watercolor painting, running through a field of sunflowers. The sun is setting, casting a warm, golden light."

Animating with Gemini (Conceptual)

While a dedicated API for video generation is not available in the same way as text or image generation, the concept is similar. Models like Veo take a text prompt and generate a short video clip. This could involve describing a scene, a character's actions, or a series of events.

For developers, you could simulate a simple animation by generating a sequence of related images and displaying them in quick succession. While this is not as seamless as a true video model, it demonstrates the underlying principle of using AI to create dynamic visual content.

Building a Simple Image Generation App with React

To put this into practice, let's build a simple React application that allows a user to enter a text prompt and generate an image using the imagen-3.0-generate-002 model. The app will include a loading state and display the generated image.

import React, { useState } from 'react';

// Main App component for image generation
export default function App() {
  const [prompt, setPrompt] = useState('');
  const [imageUrl, setImageUrl] = useState(null);
  const [isLoading, setIsLoading] = useState(false);
  const [error, setError] = useState('');

  // Function to handle the image generation request
  const handleGenerateImage = async () => {
    if (!prompt.trim()) {
      setError("Please enter a prompt to generate an image.");
      return;
    }

    setIsLoading(true);
    setError('');
    setImageUrl(null);

    try {
      // The API call to the imagen-3.0-generate-002 model
      const base64Data = await callImagenAPI(prompt);
      if (base64Data) {
        setImageUrl(`data:image/png;base64,${base64Data}`);
      } else {
        setError("Failed to generate image. Please try a different prompt.");
      }
    } catch (err) {
      console.error('Image generation error:', err);
      setError("An error occurred. Please check the console for details.");
    } finally {
      setIsLoading(false);
    }
  };

  // The core function to make the API call to the Imagen model
  const callImagenAPI = async (userPrompt) => {
    const payload = {
      instances: { prompt: userPrompt },
      parameters: { "sampleCount": 1 }
    };

    const apiKey = ""; // API key is automatically provided by the Canvas environment
    const apiUrl = `https://generativelanguage.googleapis.com/v1beta/models/imagen-3.0-generate-002:predict?key=${apiKey}`;

    const fetchOptions = {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(payload),
    };

    let response = null;
    let retries = 0;
    const maxRetries = 3;
    const initialDelay = 1000;

    // Exponential backoff for API calls
    while (retries < maxRetries) {
      try {
        const fetchResponse = await fetch(apiUrl, fetchOptions);
        if (!fetchResponse.ok) {
          throw new Error(`HTTP error! status: ${fetchResponse.status}`);
        }
        response = await fetchResponse.json();
        break;
      } catch (error) {
        retries++;
        if (retries >= maxRetries) {
          console.error("Max retries reached. Failed to call Gemini API.");
          throw error;
        }
        const delay = initialDelay * Math.pow(2, retries - 1) + Math.random() * 1000;
        console.warn(`Retrying in ${delay}ms...`);
        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }

    if (!response || !response.predictions || response.predictions.length === 0) {
      return null;
    }

    return response.predictions[0].bytesBase64Encoded;
  };

  return (
    <div className="flex flex-col h-screen bg-gray-100 font-sans antialiased">
      <header className="bg-white shadow-md p-6 text-center">
        <h1 className="text-3xl font-bold text-gray-800">Creative Content Generator</h1>
        <p className="text-gray-600 mt-2">
          Generate an image from a text prompt using the Gemini API.
        </p>
      </header>

      <main className="flex-1 overflow-y-auto p-4 sm:p-6">
        <div className="max-w-3xl mx-auto">
          <div className="flex flex-col md:flex-row gap-4 mb-6">
            <input
              type="text"
              value={prompt}
              onChange={(e) => setPrompt(e.target.value)}
              placeholder="Enter a prompt (e.g., 'A futuristic city at sunset')"
              className="flex-1 p-3 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500 transition duration-200"
              disabled={isLoading}
            />
            <button
              onClick={handleGenerateImage}
              disabled={isLoading || !prompt.trim()}
              className="px-6 py-3 bg-blue-500 text-white rounded-lg font-semibold hover:bg-blue-600 transition duration-200 disabled:bg-blue-300"
            >
              {isLoading ? 'Generating...' : 'Generate Image'}
            </button>
          </div>

          {error && (
            <div className="bg-red-100 border border-red-400 text-red-700 px-4 py-3 rounded-lg mb-4" role="alert">
              <p>{error}</p>
            </div>
          )}

          <div className="flex justify-center items-center h-full">
            {isLoading && (
              <div className="animate-spin rounded-full h-16 w-16 border-t-2 border-b-2 border-blue-500"></div>
            )}
            {imageUrl && !isLoading && (
              <div className="bg-white p-4 rounded-lg shadow-lg">
                <img
                  src={imageUrl}
                  alt="Generated by Gemini"
                  className="rounded-lg max-w-full h-auto"
                />
              </div>
            )}
          </div>
        </div>
      </main>
    </div>
  );
}