Introducing Gemini 1.5, Google’s next-generation AI model

Google recently unveiled Gemini 1.5, an improved version of Gemini 1.0 model.

What makes this model stand out is its unmatched context window, which exceeds any other model available currently on the market. It can handle over one million tokens on one prompt and serves as an invaluable tool for analyzing and organizing large textual documents.

Gemini 1.5 Pro can handle up to 128,000 tokens per context window; however, selected enterprise and developer clients can utilize AI Studio and Vertex AI with up to 1 Million Tokens providing unprecedented depth and breadth of analysis.

Contrary to GPT-4 users’ limited 32,000 token capacity or even the extended 128,000 token limit for GPT-4 Turbo users, Google is significantly expanding their capacity for inputing data into an LLM.

Gemini 1.5 now can manage single prompts containing 1 million tokens thanks to its 1M token storage capacity.

  • 1 hour of video
  • 11 hours of audio
  • A code that runs 30,000 lines
  • 700,000 words

OpenAI recently unveiled Sora, which can turn text into video with one minute-long videos created automatically by Sora’s model. Clearly the AI race for multimodal data processing is intensifying.

Also Read: How to Make Money with AI in 2024

Where Does Gemini 1.5 Fit into the LLM Market?

Google made this announcement one week after unveiling Bard’s redesign and unveiling Gemini 1.0 Ultra as its most powerful multimodal device.

Gemini 1.5 represents an enormous improvement compared to its predecessor, the Gemini Pro model which replaced Bard.

1.5 Pro outshone its predecessor across 87% of Google benchmarks used to assess large-language models (LLMs) while maintaining similar performance to that of 1.0 Ultra with less computational resources available.

Ultra is worthy of recognition as it was the first model to beat human experts at Massive Multitask Language Understanding (MMLU).

Sundar Pichai, CEO of Google and Alphabet, explained in a statement on X that their next-generation model utilizes an approach called Mixture-of-Experts (MoE), to facilitate more efficient training sessions while producing higher-quality responses.

What exactly is involved with MoE approach? According to Google DeepMind CEO Demis Hassabis, while conventional Transformer models consist of one massive neural network, MoE models contain smaller expert-grade neural networks which work together.

“MoE models learn to activate only relevant expert pathways within their neural network, significantly increasing efficiency,” Hassabis noted in the blog post announcing progress of their development.

Gemini 1.5’s design allows it to quickly grasp complex tasks and accelerate training processes, with its larger context window holding one million tokens compared with only 128,000 for GPT-4 Turbo.

Additionally, this model has the capacity to evaluate video content. A demonstration showed it being able to spot plot points during 45 minutes of running time of Buster Keaton’s silent film “Sherlock Jr.”

Machel Reid from Google DeepMind Research Scientist stated “During one of our tests scenarios we fed it complete source code, and it generated documentation for it – this was truly remarkable!”

“[In another instance], it responded accurately to questions regarding the 1924 film ‘Sherlock Jr.’ when given access to watch it from start to finish.”

The Multimodal AI Race

Gemini 1.5’s launch underscores the rapid pace of multimodal AI competition. Google is taking advantage of an opportunity presented by Gemini release instead of waiting for OpenAI’s next move to act quickly on it.

Google begins meticulously creating an AI framework in 2023. At the start of May in 2023, Pichai proposed Gemini as an innovative base model specifically tailored for multimodality.

In December, tech giant Google will officially unveil Gemini in three forms – Ultra, Pro and Nano. Although initially integrated into Bard Gemini Pro version of research assistant Bard Gemini ultimately changed into Gemini Research Assistant by February 2024.

Google unveiled Imagen 2 that month – a text-to-image diffusion model which would serve as the basis of ImageFX which debuted later that month; as well as MusicFX which converts text into music.

OpenAI’s focus on multimodality is evidence of its rapid pace. Since releasing GPT-4 in March 2023, its AI lab will soon introduce ChatGPT capabilities into GPT-4V which will launch later in September.

ChatGPT Plus and Enterprise subscribers can utilize DALL-E 3 integration for image creation during October.

OpenAI announced GPT-4 Turbo at DevDay developer conference in San Francisco one month later, alongside text-to-speech capabilities and customizing GPTs according to your specifications. This trend will continue through 2024 with GPT Store coming online in January and Sora arriving two months later.

The ultimate objective appears to be the gradual integration of multimodal capabilities in each flagship model of each vendor.

Also Read: How AI Code Generation Works

The Bottom Line

Continuous development is essential to keeping pace with the ever-evolving multimodal AI market and remaining at the top.

While OpenAI remains dominant, Gemini 1.0 and Gemini 1.5 represent signs that OpenAI could gradually lose its position.

The ultimate victor may depend on their ability to sustain an ongoing cycle of improvement while winning support from both consumers and enterprises seeking greater operational efficiencies.

Nexus Article

Nexus Article
      Nexus Article logo

      Dive into a world of daily insights at Nexus Article. Our diverse blogs span a spectrum of topics, offering fresh perspectives to elevate your knowledge. Join us on this journey of exploration and discovery.

      Quick Links

      © 2024 Nexus Article All Rights Reserved.

      Nexus Article