AI offers huge opportunity for Indian languages to expand their reach, says Vishnu Vardhanfounder of SML Generative AI, the parent company of Hanooman AI, in conversation with Anshu in New Delhi. But he adds that there are also risks. Edited excerpts:
How can AI drive positive growth in regional languages and what impact could it have on them in the next decade?
AI offers a huge opportunity for regional languages, but it also poses a huge risk. In the next decade, generative AI will become the norm. If we don’t develop robust models for Indian languages, people will increasingly rely on English, which will threaten regional languages. However, if we create AI models for these languages, especially voice-based models, it could significantly expand their use in education, communication, and entertainment.
The challenge lies in the lack of data and resources. We are still at the beginning and a few companies are focusing on this topic. Government support and open source data are essential to foster an AI ecosystem in regional languages. Without these efforts, English could dominate, but with the right push, regional languages could also thrive.
AI or Generative AI is a new thing. So when we talk about developing a chatbot or AI assistant in a regional language like Hindi, Tamil or Telugu, where does the dataset come from? Is it difficult to source the dataset?
The datasets are called tokens. Developing chatbots or AI assistants in regional languages like Hindi, Tamil, or Telugu faces challenges due to limited datasets or tokens. While English has abundant data, Indian languages lack large datasets as most of the online content is in English.
However, the potential of this technology is growing as local media, government institutions, and social networks increasingly produce content in regional languages. To create AI models for these languages, we can leverage data from media organizations, government agencies, and public domains.
Another approach is to generate synthetic data using tools such as Nvidia GPUs.
Additionally, many Indian languages share their Sanskrit roots, which makes it possible to have some common datasets across languages. By combining these methods (public data, synthetic tokens, and shared datasets), we can develop more robust AI models for Indian languages.
What key principles do AI models use for translation, considering cultural nuances that go beyond word-for-word accuracy?
Using large language models for translation is often inaccurate, which is why there are not many users of translated or local language content.
Most translation tools first convert a language to English and then to the target language, which results in a loss of context and cultural nuances, especially in technical topics. This can result in translations that are out of context or even completely different, making them unreliable for legal documents, for example.
For technical precision, the solution is to build large language models in the native language using relevant datasets. For example, instead of translating, we built a model in Hindi with both English and Hindi tokens.
This allows the model to understand and generate content directly in Hindi, capturing the context and nuances of the language, including regional variations and the use of mixed languages like Hinglish. Translation tools simply cannot offer this level of accuracy, making native language models the best approach, especially for technical content.
What is the market size of AI-based translation tools in India?
India’s regional language internet users, numbering around 500 million, represent a massive $20 billion market opportunity for AI-powered translation tools.
E-commerce, for example, could generate $4 billion in growth, while 20% of its market remains untapped due to language barriers. With better translation, sales could increase by up to 20%, bringing the potential market to $10 billion.
Online education is another key sector, which is expected to become a $10 billion market within five years. Media translation, dubbing and subtitling is a $2-5 billion sector, while general translation services for businesses represent an additional revenue potential of $5-7 billion.
In total, the market for AI-powered translation tools is worth tens of billions of dollars. Before the advent of generative AI, existing translation solutions were less accurate, limiting their impact. Today, with advances in generative AI, tools are more accurate and offer voice translation, making them more accessible and easier to use for regional language speakers.
Currently, all AI models are loss-making. Recently, Microsoft’s CFO said that it could take up to 15 years to recoup the investment. How long will it take to build a profitable business from generative AI and other AI tools?
Yes, I totally agree with that. Current AI tools are extremely expensive due to the massive investments required to develop them, which increases their cost of use. However, we are taking a different approach with our Hanooman model. It is built in a simple and efficient way, which makes it much more cost-effective. While we have not yet finalized the cost of APIs or tokens, our pricing will be significantly lower, providing a better ROI for businesses and users of generative AI.
Unlike models built with huge budgets that take years to recoup their costs, our goal is to create a multilingual AI model, optimized for all 28 official languages of India, that delivers similar results without the significant expense. With our simplified approach, we hope to reach breakeven much faster than other AI companies.
First published: September 13, 2024 | 6:36 p.m. IST