Using Google Gemini with Vertex AI – text and image recognition made easy

This article provides a brief insight into Google Gemini and explains how to explore and professionally use AI models with Vertex AI.

Google Gemini – The multimodal generative AI for speech, text and image

Google Gemini was published in 12/2023 as a response to the powerful GPT model from OpenAI. Google Gemini can be used professionally in the AI platform Vertex AI for your own applications. Google Gemini is also the new basis for the public chatbot Google Bard.

What can Google Gemini do?

As a multimodal generative AI model, Gemini can combine a variety of input and output formats. These include text, image, video and voice. Gemini’s performance is currently being compared with OpenAI’s ChatGPT in numerous benchmarks and is constantly being optimized. The available demonstration videos are impressive and make it clear that Gemini is significantly advancing the world of AI. Below is a showcase that shows how Gemini can handle complex image-based tasks. To try it out directly, all you need is an account for the Google Cloud Platform (free of charge).

Vertex AI provides Google Gemini and many other AI models

Vertex AI is Google’s comprehensive, cloud-based AI platform. It can be accessed via the Google Cloud Platform:

The comprehensive Vertex AI platform contains a complete tool landscape for using and training your own AI models. This includes the extensive “Model Garden”, which contains over 100 different AI models from Google and other providers as well as open source models. In Vertex AI, you can try out these models , adapt them and integrate them into your own applications using code. Some of the models can be further trained for your own use cases through fine-tuning, for which you can use the scalable hardware of the Google Cloud Platform.

Überblick: Vertex AI ist ein Teil der Google Cloud Platform. Die KI-Plattform stellt KI-Modelle bereit, die man anpassen und leicht in eigene Anwendungen integrieren kann.
Overview: Vertex AI is part of the Google Cloud Platform. The AI platform provides AI models that can be customized and easily integrated into your own applications

Google’s documentation provides a good overview of Vertex AI:

Overview: Which AI models are available in Google Vertex AI?

This list is just a small selection from the Vertex AI model garden. There are many more models available.

  • Gemini Pro: Best performing Gemini model with features for a wide range of tasks.
  • Gemini Pro Vision: Multimodal model designed for text, images, and videos across a wide range of tasks.
  • PaLM 2Text Bison: Fine-tuned for natural language tasks such as classification, extraction, summarization, and content generation.
  • PaLM 2Chat Bison: Designed for conducting natural conversations, suitable for chatbot applications.
  • Llama 2: Model from Meta for fine-tuning and deployment on Vertex AI.
  • Imagen for Image Generation and Editing: Specialized in generative AI for vision.
  • Chirp: A Universal Speech Model transcribing in over 100 languages.
  • Codey: Includes models for code completion, code generation, and code-related assistance.
  • Code Llama: Large language models for coding, offering state-of-the-art performance and zero-shot instruction following for programming tasks.
  • Falcon-instruct (PEFT): Model for fine-tuning and deploying with PEFT.
  • Stable Diff usion: Text-to-image diffusion models.
  • Stable Diffusion XL: Generates high fidelity images from text.
  • BERT: Neural network-based NLP technique for creating question answering systems and more.
  • BLIP2: Used for image captioning and visual-question-answering tasks.
  • T5-FLAN: T5 model with T5-FLAN checkpoint.
  • Dolly-v2-7b: Instruction-following large language model.
  • OpenLLaMA (PEFT): Fine-tune and deploy with PEFT.
  • Mistral-7B: Engineered for superior performance and efficiency.
  • BioGPT: Domain-specific language model pre-trained on biomedical literature.
  • Vicuna: Chat assistant trained on user-shared conversations.

How to use Google Gemini in Vertex AI

You don’t need to be a cloud professional to carry out the following steps. However, the Google Cloud Platform (GCP) is rather confusing due to the variety of tools available there, so you have to be careful.

Below we test Gemini with the following tasks:

  • Task 1 (text model): List Germany’s chancellors of the last 50 years
  • Task 2 (multimodal model): Recognize images, calculate prices

Step 1: Call up Google Cloud Platform

First, go to the GCP and register, for which you have to enter a payment method (credit card, PayPal). Using Google Cloud products normally incurs costs, but as a beginner you get a $300 voucher. Even if you have already used this up, you only have to pay a small fee of less than €1 for the following steps. What you should pay attention to: Delete your models and notebooks after you have tested so that you don’t have to pay any ongoing costs.

https://cloud.google.com

Step 2: Call Vertex AI and activate APIs

To be able to use Vertex AI for the first time, it is necessary to activate the required APIs in the Google Cloud Platform for security reasons. This can be done very easily with a click and Google has set up a wizard where you can activate all APIs with a click. Simply follow the prompt when you call up Vertex AI for the first time.

Step 3: Call up Vertex AI Studio

In Vertex AI, navigate to the Vertex AI Studio. There you can simply try out AI models without any coding experience. Here you have the choice between “Multimodal” (combined AI models such as Gemini), “Language” (text-based language models), “Vision” (image) and “Speech”. First select “Language”. In the overview, you can create a new prompt, reuse existing prompts and initiate further training.

Now let’s try out Gemini’s text and multimodal capabilities one after the other.

Task 1: Google Gemini with text prompt (“Answer questions”)

  • Click on “Text Prompt” under “Generate Text” to give the AI a simple question/answer task (a so-called “completion” task).

Settings:

  • Model: Gemini Pro
  • Region: Select the desired server location here. Attention: Data protection: Please note that your data will be sent to servers in the corresponding country. There are currently no European locations available.
  • Temperature: This parameter controls the creativity of the model. The higher the temperature, the freer the answer can be. Select 0 if you would rather have facts without embellishments or free spaces. Note: This does not protect the model from hallucinations!
  • Token limit: If you want longer inputs and outputs, you can increase the value. Many long texts cost more computing power and therefore also costs.
  • Prompt: Enter the prompt, for example: “List all chancellors (Kanzler) in Germany in the last 50 years.” and click on “Submit”

Result:

Google Gemini correctly lists the German chancellors, even including the respective term of office.

Task 2: Google Gemini with multimodal prompt (“Understand images”)

  • Click on a desired sample prompt under “Multimodal” to try out the example.
  • We are testing the more complex AI Usecase “Image question answering” here
  • Further possibilities: Gemini can generate ad headlines or descriptions from videos or images. Or exciting for developers: answer questions about uploaded images and return them in JSON format so that they can be used directly in code.

Settings:

  • Select Model, Region, Temperature as before
  • Accept the prompt or customize it as desired
  • Click on “Submit”

Result:

The result is really amazing. Gemini correctly returns the price of the Brazil nuts and converts from 250g to 1kg. The model has therefore successfully implemented the following steps independently:

  • Recognize that we are interested in the Brazil nuts from image 1
  • recognize them in picture 2,
  • extract the price and quantity from the image,
  • translate the language from Spanish to English
  • Convert quantities and prices

Here again the prompt and the images of this impressive example of Gemini’s capabilities.

Prompt “What is the price of this for a kilogram?”
Picture 1
Picture 2

Code example: Integrating Google Gemini into your own applications

Practical: Google provides code for integrating the model directly with a click in the Studio in many languages including Python or Java (“Get Code”, top right). If you want to test this quickly, you can create an Enterprise Colab in the GCP (i.e. a Jupyter Notebook in the Google Cloud) and integrate and execute the code there directly. Simply click on “Open in Notebook”.

Attention: If you create an application outside the GCP, you must first create and release an endpoint in Vertex AI and then authenticate and retrieve it in your own code. Here is the Python code for installing the Vertex AI library and for calling Gemini.

Installing the library with pip:

!pip install --upgrade google-cloud-aiplatform

Python code to query Google Gemini:

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def generate():

  model = GenerativeModel("gemini-pro")
  responses = model.generate_content(
    """List all chancellors (\"Kanzler\") in Germany of the last 50 years.""",
    generation_config={
      "max_output_tokens": 2048,
      "temperature": 0,
      "top_p": 1
    },
    stream=True,
  )

  for response in responses:
    print(response.candidates[0].content.parts[0].text)

generate()

Conclusion: Vertex AI makes Google Gemini easy to use

The results of the Google Gemini model– especially for multimodal, complex tasks – are convincing. Google Gemini is a powerful AI with impressive multimodal capabilities. The professional AI platform Vertex AI makes it pleasantly easy to try out and use AI models. AI models of all kinds can be easily integrated into your own applications.