When you think about generative AI models, you probably think about the large language models (LLMs) that have made such a splash in recent years. However, generative AI itself dates back many decades, and LLMs are just the latest evolution. And alongside LLMs, many different kinds of generative AI models are used for different generative AI tools and use cases, such as diffusion models that are used for image generation.
In this article, we’ll explain what generative AI models are, how they’re developed, and provide a deeper dive into some of the most common generative AI models today—enough to give you a conceptual understanding of these models that will impress your friends and colleagues, without you needing to take a college course in machine learning.
Table of contents
- What is a generative AI model?
- How generative AI models work
- How are generative AI models developed?
- Types of generative AI models
- Conclusion
What is a generative AI model?
Generative AI models are a subset of artificial intelligence systems that specialize in creating new, original content that mirrors the characteristics of their training data. Through learning from patterns and relationships in data, these models can generate outputs like text, images, sounds, or videos that resemble the style, tone, and nuances of their source material. This capability positions generative AI at the heart of innovation, allowing for creative and dynamic applications across diverse fields by interpreting and transforming input data into novel creations.
How generative AI models work
Generative AI models function by leveraging a sophisticated form of machine learning algorithm known as a neural network. A neural network comprises multiple layers of interconnected nodes, each represented by a snippet of computer code. These nodes perform minor, individual tasks but collectively contribute to making complex decisions, mirroring the neuron functionality in the human brain.
To illustrate, consider a neural network tasked with distinguishing between images of pies and cakes. The network analyzes the image at a granular level, breaking it into pixels. At a very basic level, there will be different nodes in the network dedicated to understanding different pixels and groups of pixels. Maybe some will look at whether there are layers in the dessert, while others will determine if there’s frosting or a crust. The nodes each store information about the features of what pie vs. cake looks like, and whenever a new image comes into play, it’s processed through each and every node to output a final prediction.
In the context of generative AI, this principle extends beyond simply recognition to the creation of new, original content. Instead of merely identifying features, generative models use neural networks to understand the underlying patterns and structures of the data they’re trained on. This process involves complex interactions and adjustments within the neural network, guided by algorithms designed to optimize the creativity and accuracy of the generated output.
How are generative AI models developed?
The development of generative AI models involves a series of complex and interrelated steps, typically carried out by teams of researchers and engineers. These models, such as GPT (generative pre-trained transformer), from OpenAI, and other similar architectures, are designed to generate new content that mimics the distribution of the data they were trained on.
Here’s a step-by-step breakdown of that process:
1 Data collection
Data scientists and engineers first determine the goals and requirements of their project, which guides them to collect a wide and appropriate dataset. They often use public datasets, which offer vast quantities of text or images for their needs. For instance, the training of ChatGPT (GPT-3.5) involved processing 570GB of data, equivalent to 300 billion words from public internet sources, including nearly all of Wikipedia’s content.
2 Model selection
Choosing the right model architecture is a critical step in developing generative AI systems. The decision is guided by the nature of the task at hand, the type of data available, the desired quality of the output, and computational constraints. Specific architectures, including VAEs, GANs, and transformer-based and diffusion models, will be discussed in more detail later in this article. At this stage, it’s important to understand that new models often start from a preexisting architecture framework. This approach leverages proven structures as a foundation, allowing for refinements and innovations tailored to the unique requirements of the project at hand.
3 Model training
The chosen model is trained using the collected dataset from the first step. Training generative AI models often requires a large amount of computing power, using special hardware like GPUs (graphics processing units) and TPUs (tensor processing units). While the training approach varies based on the model architecture, all models go through a process called hyperparameter tuning. This is where data scientists adjust certain performance settings to achieve the best results.
4 Evaluation and fine-tuning
Finally, model performance is evaluated or tested in the real world. Evaluating generative AI models is a bit different from evaluating traditional machine learning models because generative AI creates an entirely new output, and the quality of this output tends to be subjective. Metrics differ based on what the model is creating, and evaluation techniques for generative AI typically include using human raters—and may employ the strategy of having generative AI models evaluate one another. Learnings from the evaluation stage are typically applied back into fine-tuning the model or even retraining it. After the model’s performance is validated, it’s ready for production.
Types of generative AI models
Building on our foundational knowledge of generative AI models and the neural networks that power them, we’re now set to dive into specific types of model architectures that have emerged since the early 2010s. We’ll explore each model’s unique strengths and weaknesses, as well as their practical applications.
Here’s a brief overview of the models we’ll be discussing:
- Variational autoencoders (VAEs) are adept at learning complex data distributions and are often used for tasks like image generating and editing.
- Generative adversarial networks (GANs) are known for their ability to create highly realistic images and have become popular in a variety of creative applications.
- Diffusion models are a newer class of models that generate high-quality samples through a process of gradually adding and then removing noise.
- Language models excel at understanding and generating human language, making them useful for applications like chatbots and text completion.
- Transformer-based models were initially designed for natural language processing (NLP) tasks but have been adapted for use in generative models due to their powerful ability to handle sequential data.
Let’s delve deeper into each of these architectures to understand how they work and where they can be best applied.
Variational autoencoders (VAEs)
Variational autoencoders were invented by Max Welling and Diederik P. Kingma in 2013. They rely on the fact that a neural network can encode the high-level concepts the model learns during the training step. This is sometimes referred to as a “compression” or “projection” of the raw data.
If a model looks at an image of a cake, for example, it might turn that into an encoding containing all of the image’s features—sprinkles, frosting color, spongy layers, etc. This encoding looks like a set of numbers that makes sense to the model but not to humans. It can be decoded by yet another neural network to try to re-create the original image—though it will have some gaps because the encoding is a compression. This type of model, with the encoder and decoder pieces working together, is called an autoencoder.
Variational autoencoders put a spin on the autoencoder idea to generate new outputs. When generating its encodings, a VAE uses probabilities instead of discrete numbers. After all, does whipped cream count as frosting? Sometimes yes; sometimes no.
It turns out that if you train a neural network to create these probabilistic encodings and train another neural network to decode them, you can get some pretty interesting results. The decoder can sample points in the variational encoding “space” and create entirely new outputs that will still appear realistic because they have preserved the probabilistic relationships of the training data.
Advantages and disadvantages
Variational autoencoders use unsupervised learning, which means that the model learns on its own from raw data without requiring humans to label different features or outcomes. Such models are especially successful at creating content that deviates slightly from the original. Because of how they work with encodings, they can also be given specific instructions based on features of the training data: “Show me a dessert that represents the perfect midpoint between cake and pie.” That said, VAEs optimize for likely outcomes, so they are unlikely to excel at creating very original or groundbreaking content.
One common complaint about VAEs is that they can produce noisy (i.e., blurry) images due to the fact that encoding and decoding involves compression, which introduces loss of information.
Use cases
Variational autoencoders work with all kinds of data, though they are primarily used to generate images, audio, and text. One interesting application is anomaly detection: In a dataset, VAEs can find the data points that deviate the most from the norm, because those points will have the highest reconstruction error—meaning they will be the furthest from the probabilities that the VAE has encoded.
Generative adversarial networks (GANs)
Generative adversarial networks were developed by Ian Goodfellow in 2014. While neural networks had been able to generate images before that, the results tended to be blurry and unconvincing. The core question (and insight) behind GANs is this: What happens if you pit two neural networks against each other? One, called the generator, is taught to generate new content, while another, called the discriminator, is trained to know the difference between real and fake content.
The generator creates candidate images and shows them to the discriminator. Based on the feedback, the generator updates its predictions accordingly, getting better and better at “fooling” the discriminator. Once it can fool the discriminator 50% of the time (as good as a coin toss between real and fake), the feedback training loop stops. The generator part of the GAN is then ready for evaluation and production.
Since 2014, hundreds of variations of GANs have been developed for different use cases and to balance the inherent advantages and disadvantages of GANs.
Advantages and disadvantages
Generative adversarial networks, along with VAEs, initially sparked a lot of buzz around the potential of generative AI. They use unsupervised learning, so the model gets better on its own without researchers needing to tell it whether its outputs are good or bad. Generative adversarial networks also manage to learn very quickly; compared to other existing solutions when they were first released, they could get good results with much less training data—hundreds of images compared to thousands.
However, GANs generally struggle to create content that doesn’t resemble their training data—they are impersonators, not creators. And sometimes they can “overfit” their training data, such as when GANs created images of cat photos containing letters because they were shown a lot of cat memes.
Training a GAN is a challenge. Two networks must be juggled during training. Issues can also arise when the discriminator is too good, leading to training cycles that never end—or if the discriminator is not good enough, which leads to poor outcomes. They can also suffer from what’s called mode collapse, where they fail to produce diverse outputs because the generator learns a few ways to trick the discriminator and focuses on those strategies to the exclusion of others.
Use cases
Generative adversarial networks are used primarily to generate content that is very similar to the original. For example, they can produce convincing human faces or realistic photos of interiors or landscapes for use in stock photography or video games. They can also create images that have been altered in some way, such as changing an image from color to black and white or aging a face in an image. That said, not all GANs produce images. For example, some GANs have been used to produce text-to-speech output.
Diffusion models
Diffusion models also came about in the mid-2010s, offering some breakthroughs that delivered better performance by the early 2020s. They power image-generation tools like DALL-E, Stable Diffusion, and Midjourney.
Diffusion models work by introducing Gaussian noise to an image, distorting it in a series of steps, and then training a model to reverse these steps and transform the “noisy” image into a clear one. (“Gaussian noise” just means the noise is randomly added using a bell curve of probabilities.)
You can think of the noisy image as being kind of like the VAE encoding, and indeed VAEs and diffusion models are related. Training-data images of, say, key lime pie, will end up with pretty similar noisy versions. But even the same noisy image won’t be “denoised” to the same thing every time, because the model is making educated guesses along the way.
You might have already figured out where the generative part comes in. If you give the model a representation of the image in the noisy space, it will be able to denoise the image and come up with an entirely new, clear picture. It’s sort of like how the decoder samples from the encoding. But there’s one important difference: There hasn’t been any compression along the way. So there’s been no real loss of data, and the resulting image will be higher-quality.
Generative AI tools that go from a text prompt to an image do that with the help of a separate model that understands how something like a “unicorn-themed birthday cake” might map to different image features. The noisy version of those features is then reversed to reveal a clear picture.
Advantages and disadvantages
Diffusion models don’t compress the training data, so they manage to create very realistic, high-quality images. However, they take significantly more resources and time to train than other models. That said, the training itself is more straightforward because they don’t run into the mode collapse of GANs and other drawbacks of the adversarial network. They also don’t suffer from the loss of data (and resulting lower-quality outputs) that VAEs have.
Use cases
Diffusion models are primarily used for image, sound, and video generation. There’s no inherent reason that they couldn’t be used to generate text as well, but so far, transformer-based models have been more effective for natural language.
Language models
Language models refer to any machine learning technique that generates a probabilistic model of natural language. The most well-known type of language model today is large language models (LLMs), which are trained on massive amounts of raw data and use a transformer-based architecture to generate text. (More on transformers in the next section.)
Before transformer-based models, most state-of-the-art language models used recurrent neural networks (RNNs). Recurrent neural networks introduce small loops in the interconnections between the nodes so that in addition to learning from the present signals, as in a traditional feedforward neural network, nodes can also learn from the recent past. This is important for processing or generating natural language, like a stream of text or a voice input. Unlike images, language is highly contextual—how we interpret it depends on what has come before.
Advantages and disadvantages
Because “language models” refers to such a large group of models, it’s difficult to generalize about their advantages and disadvantages. The challenges of language modeling include the fact that language is so high-dimensional—there are a vast number of different words in any given language, and some combinations might never appear in the training data.
Furthermore, language depends greatly on the context of what has come before in the sequence, requiring the network to handle or represent that context in some way. The capacity to address this need has led RNNs with long- and short-term memories and subsequently transformers, which can process an entire sentence as a whole, to emerge as the state-of-the-art architecture for language models.
Use cases
Language models can be used for translation, summarization, grammatical error correction, speech recognition, and many more tasks. They are used to generate new creative text content with many applications and are proving to be capable of advanced reasoning, such as analyzing data and solving logic puzzles. Interestingly, research has found that an emergent capability of LLMs is spatial awareness and the ability to create basic drawings, even though they are trained entirely on text.
Transformer-based models
Transformers, invented by researchers at Google and the University of Toronto in 2017, revolutionized the field of deep learning. Large language models like ChatGPT are transformer-based models, and Google search results are also powered by transformers.
A transformer-based model uses its training data to learn how different words are related. For example, it might learn that cake and pie are conceptually similar, whereas cake and cape are not directly related. It might also learn that slice can be linked to cake and pie, especially if those words occur in close proximity.
When analyzing text, the model uses this baseline understanding to construct what resembles a massive spreadsheet. It can look up any two words in the text and get an answer about how related they probably are.
By leveraging these contextual cues, a transformer model adeptly interprets language and forecasts potential continuities in a conversation. For instance, if someone mentions a cake in one segment and then shifts to discussing their birthday in the next, the model anticipates the eventual mention of candles or a party, based on the established linguistic connections.
Advantages and disadvantages
When it comes to analyzing and generating language, transformers have a few advantages over RNNS, their predecessors. They can process text in parallel across the network rather than processing each word sequentially. This makes them faster and more efficient to train on very large datasets. They can also make connections between words regardless of how far apart they are, allowing them to leverage more context from the text.
However, transformers need a lot of data to perform well, and with smaller datasets, more traditional neural network architectures may work better.
Use cases
Transformers have many generative AI applications. While transformer-based models are typically used to generate text or speech, researchers are exploring their use for image generation, as they are less computationally intensive than diffusion models.
Most famously, LLMs are transformer-based models. Language models use only the decoder portion of the architecture. The prompt is fed into the model as an encoding—that set of numerical values, probabilities, and attention data we mentioned earlier. The model decodes the input using the self-attention mechanism and by looking at all the words in the prompt in parallel. The model’s goal is to output a prediction for the next word in the sentence.
Transformers have many applications outside of generating text in natural language processing. In fact, they were originally conceived to translate, or transform, text from one language to another. Grammarly has contributed research toward using transformers to correct grammar mistakes.
Conclusion
Generative AI models have come a long way in the past decade. We hope that now you understand a little bit more about the evolution of these models, how they work, and how they might be applied to different use cases. This article has just scratched the surface, however, and left out many important details with the aim of providing an overview for the average reader. We encourage you to continue learning about the math and science behind these models by studying the research papers that they are based on and learning more about how they work from a probabilistic and statistical perspective.