IIn many ways, 2023 was the year people started to understand what AI really is and what it can do. This was the year chatbots went truly viral and the year governments started taking the risks of AI seriously. These developments were not so much new innovations as technologies and ideas that took center stage after a long gestation period.
But there have also been many new innovations. Here are three of the biggest from last year:
Multimodality
“Multimodality” may sound like jargon, but it’s worth understanding what it means: it’s the ability of an AI system to process many different types of data: not just text, but also images, videos, audio and much more.
This year was the first time the public had access to powerful multi-modal AI models. OpenAI’s GPT-4 was the first of these, allowing users to upload images as well as text inputs. GPT-4 can “see” the contents of an image, which opens up all kinds of possibilities, such as asking it what to make for dinner based on a photograph of the contents of your refrigerator. In September, OpenAI rolled out the ability for users to interact with ChatGPT through voice and text.
Google DeepMind’s latest Gemini model, announced in December, can also work with images and sound. A launch video shared by Google showed the model identifying a duck based on a line drawing on a post-it note. In the same video, after seeing an image of pink and blue yarn and asking what it could be used to create, Gemini generated an image of a pink and blue octopus stuffed animal. (The marketing video appeared to show Gemini observing moving images and responding to audio commands in real time, but in a post on its website, Google said the video had been edited for brevity and that the model was invited to use still images, not still images. video and text prompts, not audio, although the template has audio capabilities.)
“I think the next milestone that people will look back on and remember is (AI systems) becoming much more multimodal,” Google DeepMind co-founder Shane Legg said at a conference Press. podcast in October. “This transition is still in its early stages, and when you start to really digest a lot of videos and things like that, these systems will start to have a much deeper understanding of the world.” In an interview with TIME in November, Sam Altman, CEO of OpenAI, said that multimodality in the company’s new models would be one of the key elements to watch for next year.
Learn more: Sam Altman is TIME’s 2023 CEO of the Year
The promise of multimodality is not just about making models more useful. It is also that the models can be qualified on many new datasets – images, video, audio – that contain more information about the world than text alone. The belief within many large AI companies is that this new training data will result in models becoming better or more powerful. It’s a step on the path, many AI scientists hope, toward “artificial general intelligence,” the kind of system capable of matching human intellect, making new scientific discoveries, and performing economically valuable work.
Constitutional AI
One of the biggest unanswered questions in AI is how to align this to human values. If these systems become more intelligent and more powerful than humans, they could cause incalculable harm to our species – some even speak of total extinction – unless, somehow, they are constrained by rules that place human flourishing at the center of their concerns.
The process OpenAI uses to align ChatGPT (to avoid racist and sexist behavior previous models) worked well, but it required a large amount of human labor, using a technique known as “reinforcement learning with human feedback” or RLHF. Human reviewers would evaluate the AI’s responses and give it the computing equivalent of a dog treat if the response was helpful, harmless, and complied with OpenAI’s list of content rules. By rewarding AI when it was good and punishing it when it was bad, OpenAI developed an effective and relatively harmless chatbot.
But because the RLHF process relies heavily on human labor, there remains a big question mark over its scalability. It’s expensive. It depends on the biases or errors made by individual raters. The more complicated the list of rules, the more failure-prone the situation. And it seems unlikely that it would work for AI systems so powerful that they start doing things that humans can’t understand.
Constitutional AI: first described by researchers at leading AI lab Anthropic in December 2022 paper– attempts to solve these problems, exploiting the fact that AI systems are now sufficiently capable of understanding natural language. The idea is quite simple. First, you write a “constitution” that defines the values you want your AI to follow. Then you train the AI to score the answers based on how well they fit the constitution, and then incentivize the model to produce answers that score higher. Instead of reinforcement learning from human feedback, it is reinforcement learning from AI Comments. “These methods allow AI behavior to be controlled more precisely and with far fewer human labels,” the Anthropic researchers wrote. Constitutional AI was used to align Claude, Anthropic’s 2023 answer to ChatGPT. (Investors in Anthropic include Salesforce, where TIME co-chairman and owner Marc Benioff is CEO.)
“With constitutional AI, you explicitly write out the normative premises with which your model should approach the world,” Jack Clark, head of policy at Anthropic, told TIME in August. “Then the model trains on that.” There are still problems, such as the difficulty of ensuring that the AI has understood both the letter and the spirit of the rules (“you’re stacking your chips on a big opaque AI model,” Clark says) , but the technique is a promising addition to a field where new alignment strategies are rare.
Of course, constitutional AI does not answer the question of whether of which AI values must be aligned. But Anthropic is trying to democratize this issue. In October, the lab conducted an experiment asking a representative group of 1,000 Americans to help it choose rules for a chatbot and found that while there was some polarization, it was still possible to write a viable constitution based on the declarations the group had arrived at. a consensus on. Experiments like this could open the door to a future in which ordinary people have much more say in how AI is governed, compared to today, where a small number of AI leaders Silicon Valley writes the rules.
Text to video
One notable result of the billions of dollars invested in AI this year has been the rapid rise of text-to-video conversion tools. Last year, text-to-image conversion tools had barely out of their childhood; Today, several companies offer the ability to transform sentences into animated images with increasingly fine levels of precision.
One of these companies is Track, a Brooklyn-based AI video startup that wants to make filmmaking accessible to everyone. Its latest model, Gen-2, allows users to not only generate video from text, but also change the style of an existing video based on a text prompt (e.g. transform a photo from boxes of cereals on a table in a nighttime urban landscape). ,) in a process he calls video to video.
“Our mission is to create tools for human creativity,” Runway CEO Cristobal Valenzuela told TIME in May. He acknowledges that this will impact jobs in the creative industries, where AI tools are quickly making some forms of technical expertise obsolete, but he believes the world across the way is worth it. “Our vision is a world where human creativity is amplified and enhanced, and which depends less on the craftsmanship, budget, technical specifications and knowledge you have, and more on your ideas.” (Investors in Runway include Salesforce, where TIME co-chairman and owner Marc Benioff is CEO.)
Another startup in the text-to-video conversion space is Pika AI, which is reportedly used to create millions of new videos every week. Led by two Stanford dropouts, the company launched in April but has already secured funding that values it between $200 million and $300 million, according to Forbes. Introduced not to professional filmmakers but to the general user, free tools like Pika are attempting to transform the user-generated content landscape. This could happen as soon as 2024, but text-to-video conversion tools are computationally expensive, so don’t be surprised if they start charging for access once the venture capital runs out.