×

注意!页面内容来自https://news.mit.edu/2025/ai-tool-generates-high-quality-images-faster-0321,本站不储存任何内容,为了更好的阅读体验进行在线解析,若有广告出现,请及时反馈。若您觉得侵犯了您的利益,请通知我们进行删除,然后访问 原网页

Skip to content ↓

AI tool generates high-quality images faster than state-of-the-art approaches

Researchers fuse the best of two popular methods to create an image generator that uses less energy and can run locally on a laptop or smartphone.

Press Contact:

Melanie Grados
Phone: 617-253-1682
MIT News Office

Media Download

An image being put together by two sets of tweezers; one branded as autoregressive and the other as diffusion
Download Image
Caption: Researchers combined two types of generative AI modelsan autoregressive model and a diffusion modelto create a tool that leverages the best of each model to rapidly generate high-quality images.
Credits: Credit: Christine DaniloffMIT; image of astronaut on horseback courtesy of the researchers
Four AI-generated images of an astronaut riding a horse
Download Image
Caption: The new image generatorcalled HART (short for Hybrid Autoregressive Transformer)can generate images that match or exceed the quality of state-of-the-art diffusion modelsbut do so about nine times faster.
Credits: Credit: Courtesy of the researchers

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entitiespress and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license. You may not alter the images providedother than to crop them to size. A credit line must be used when reproducing images; if one is not provided belowcredit the images to "MIT."

Close
An image being put together by two sets of tweezers; one branded as autoregressive and the other as diffusion
Caption:
Researchers combined two types of generative AI modelsan autoregressive model and a diffusion modelto create a tool that leverages the best of each model to rapidly generate high-quality images.
Credits:
Credit: Christine DaniloffMIT; image of astronaut on horseback courtesy of the researchers
Four AI-generated images of an astronaut riding a horse
Caption:
The new image generatorcalled HART (short for Hybrid Autoregressive Transformer)can generate images that match or exceed the quality of state-of-the-art diffusion modelsbut do so about nine times faster.
Credits:
Credit: Courtesy of the researchers

The ability to generate high-quality images quickly is crucial for producing realistic simulated environments that can be used to train self-driving cars to avoid unpredictable hazardsmaking them safer on real streets.

But the generative artificial intelligence techniques increasingly being used to produce such images have drawbacks. One popular type of modelcalled a diffusion modelcan create stunningly realistic images but is too slow and computationally intensive for many applications. On the other handthe autoregressive models that power LLMs like ChatGPT are much fasterbut they produce poorer-quality images that are often riddled with errors.

Researchers from MIT and NVIDIA developed a new approach that brings together the best of both methods. Their hybrid image-generation tool uses an autoregressive model to quickly capture the big picture and then a small diffusion model to refine the details of the image.

Their toolknown as HART (short for hybrid autoregressive transformer)can generate images that match or exceed the quality of state-of-the-art diffusion modelsbut do so about nine times faster.

The generation process consumes fewer computational resources than typical diffusion modelsenabling HART to run locally on a commercial laptop or smartphone. A user only needs to enter one natural language prompt into the HART interface to generate an image.

HART could have a wide range of applicationssuch as helping researchers train robots to complete complex real-world tasks and aiding designers in producing striking scenes for video games.

“If you are painting a landscapeand you just paint the entire canvas onceit might not look very good. But if you paint the big picture and then refine the image with smaller brush strokesyour painting could look a lot better. That is the basic idea with HART,” says Haotian Tang SM ’22PhD ’25co-lead author of a new paper on HART.

He is joined by co-lead author Yecheng Wuan undergraduate student at Tsinghua University; senior author Song Hanan associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS)a member of the MIT-IBM Watson AI Laband a distinguished scientist of NVIDIA; as well as others at MITTsinghua Universityand NVIDIA. The research will be presented at the International Conference on Learning Representations.

The best of both worlds

Popular diffusion modelssuch as Stable Diffusion and DALL-Eare known to produce highly detailed images. These models generate images through an iterative process where they predict some amount of random noise on each pixelsubtract the noisethen repeat the process of predicting and “de-noising” multiple times until they generate a new image that is completely free of noise.

Because the diffusion model de-noises all pixels in an image at each stepand there may be 30 or more stepsthe process is slow and computationally expensive. But because the model has multiple chances to correct details it got wrongthe images are high-quality.

Autoregressive modelscommonly used for predicting textcan generate images by predicting patches of an image sequentiallya few pixels at a time. They can’t go back and correct their mistakesbut the sequential prediction process is much faster than diffusion.

These models use representations known as tokens to make predictions. An autoregressive model utilizes an autoencoder to compress raw image pixels into discrete tokens as well as reconstruct the image from predicted tokens. While this boosts the model’s speedthe information loss that occurs during compression causes errors when the model generates a new image.

With HARTthe researchers developed a hybrid approach that uses an autoregressive model to predict compresseddiscrete image tokensthen a small diffusion model to predict residual tokens. Residual tokens compensate for the model’s information loss by capturing details left out by discrete tokens.

“We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency detailslike edges of an objector a person’s haireyesor mouth. These are places where discrete tokens can make mistakes,” says Tang.

Because the diffusion model only predicts the remaining details after the autoregressive model has done its jobit can accomplish the task in eight stepsinstead of the usual 30 or more a standard diffusion model requires to generate an entire image. This minimal overhead of the additional diffusion model allows HART to retain the speed advantage of the autoregressive model while significantly enhancing its ability to generate intricate image details.

“The diffusion model has an easier job to dowhich leads to more efficiency,” he adds.

Outperforming larger models

During the development of HARTthe researchers encountered challenges in effectively integrating the diffusion model to enhance the autoregressive model. They found that incorporating the diffusion model in the early stages of the autoregressive process resulted in an accumulation of errors. Insteadtheir final design of applying the diffusion model to predict only residual tokens as the final step significantly improved generation quality.

Their methodwhich uses a combination of an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameterscan generate images of the same quality as those created by a diffusion model with 2 billion parametersbut it does so about nine times faster. It uses about 31 percent less computation than state-of-the-art models.

Moreoverbecause HART uses an autoregressive model to do the bulk of the work — the same type of model that powers LLMs — it is more compatible for integration with the new class of unified vision-language generative models. In the futureone could interact with a unified vision-language generative modelperhaps by asking it to show the intermediate steps required to assemble a piece of furniture.

“LLMs are a good interface for all sorts of modelslike multimodal models and models that can reason. This is a way to push the intelligence to a new frontier. An efficient image-generation model would unlock a lot of possibilities,” he says.

In the futurethe researchers want to go down this path and build vision-language models on top of the HART architecture. Since HART is scalable and generalizable to multiple modalitiesthey also want to apply it for video generation and audio prediction tasks.

This research was fundedin partby the MIT-IBM Watson AI Labthe MIT and Amazon Science Hubthe MIT AI Hardware Programand the U.S. National Science Foundation. The GPU infrastructure for training this model was donated by NVIDIA. 

Related Links

Related Topics

Related Articles

More MIT News