Skip to main content
Guest Column

AI models for India

Language-specific rules are needed to ensure that AI systems properly understand sentences in Tamil, Hindi, and other languages.

Indian enterprises have the opportunity to create domain-specific AI foundational models and build LLMs in Indian languages.

Balaraman Ravindran Balaraman Ravindran is Professor and Head of the Wadhwani School of Data Science & AI, IIT Madras.

Artificial intelligence (AI) is rapidly reshaping economies and societies worldwide. Nations are racing to develop their own foundational AI models: these are AI models, such as large language models (LLMs), that are trained on broad data that can be adapted to a wide range of downstream tasks. India finds itself at a crucial juncture: what should its AI strategy be when it comes to creating foundation models, AI applications, and the required infrastructure?

The IndiaAI Mission’s recent call for proposals to build a suite of foundation models that aim to be indigenous, domain-specific, Indic language-enabled, and run in India, is in the right direction. The emphasis is on obtaining quick wins in the next six to nine months. But we must realise that this represents the first few steps in a longer journey to create Indian AI models that are cost-effective, secure, and aligned with national interests.

Krishnan Narayanan Krishnan Narayanan, is Co-founder and President of itihaasa Research and Digital. He studies the evolution of technology domains in India.

India should build foundational models for critical areas like national security, healthcare, and governance while relying on global models for less-sensitive sectors. In order to train these models, we need significantly large datasets. Even DeepSeek, which marks a ‘Sputnik moment’ in the global AI race, has said that it needs more than a trillion tokens to train its base model. However, building such datasets will be a time-intensive and expensive process, especially when the data need to be curated and made safe for the Indian context. 

 

ALTERNATIVE MODELS

Given the constraints in training large-scale foundation models from scratch, an immediate opportunity lies in developing highly focused smaller language and domain-specific models designed for specific applications. While the initial training process might still rely on larger global models, the resulting system can be fine-tuned to address a narrow and well-defined task. This method enables better control over AI behaviour, making it easier to implement safeguards and ethical guidelines aligned to Indian culture and regulations.

Smaller models also have practical advantages: they require significantly less computational power, allowing for faster inference and lower energy consumption. However, there is a trade-off: such models often lack the nuanced understanding and reasoning capabilities of their larger counterparts. Advancements in cost-efficient AI inference, model distillation, and adaptive learning techniques are needed to bridge this gap while maintaining efficiency.

Indian enterprises, both public and private, have the opportunity to create domain-specific AI models. They can rely on open-source architectures and fine-tuning techniques, and leverage Retrieval-Augmented Generation (RAG) for integrating their domain-specific knowledge and unique, high-quality datasets. Newer fundamental research is required to develop approaches that can help us train AI models using less domain-specific data.

INDIAN-LANGUAGE LLMS

Building LLMs in Indian languages presents its own set of challenges. The most widely discussed issue is the lack of training data for resource-poor languages, but the complexity goes beyond mere data scarcity. One of the key concerns in this space is the effectiveness of tokenisation methods. Different languages have different syntactic structures, and a one-size-fits-all tokenisation approach may not be ideal.

Let us take a simple example to understand the issue. Consider a sentence in English, Tamil, and Hindi.

  • In English, the sentence “The cat sat on the mat” can be tokenised in a straightforward manner as [‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’].
  • In Tamil, words change form based on tense, gender, and case. Depending on the reference to the word “book” as a subject, object, or possession (of the book), the corresponding Tamil words differ (புத்தகம், புத்தகத்தை, புத்தகத்தின்). So, if the token just says “book”, the meaning isn’t always clear unless the case suffix is added.
  • In Hindi, meaning often comes from multi-word verb phrases. For example, “खेल रहा है” is in present continuous tense. Breaking this phrase incorrectly could lose meaning. A typical tokenisation might separate ‘खेल’ from ‘रहा है’, which would make it hard for AI models to understand that this is a continuous action.

Thus, language-specific rules are needed to ensure that AI systems properly understand sentences in Tamil, Hindi, and other languages. The right tokenisation strategy can dramatically reduce training compute costs and inference time, making AI models more efficient and accessible. More research is required in this area.

A DPI FOR AI

India has successfully implemented a Digital Public Infrastructure (DPI) approach in fintech (Unified Payments Interface) and identity (Aadhaar). The same model should now be extended to AI foundation models. Building a DPI for AI will democratise access to high-quality datasets, computational resources, and model training tools, enabling start-ups, academia, and enterprises to innovate at scale.

One opportunity is to create robust datasets for testing and benchmarking models in local Indian languages. Simple translation of existing English-based datasets into Indian languages is not an adequate solution, as it fails to capture the nuances of native speech. A model trained on poorly adapted translations will likely produce outputs that do not sound natural to a native speaker.

Factual accuracy is another pressing issue. For instance, AI systems deployed in India must be explicitly trained to recognise and respect the country’s officially recognised national boundaries to avoid generating contentious or misleading information. Ensuring factual integrity requires a more rigorous approach to dataset curation and validation, beyond just training on internet data.

To support AI development, India should also create Application Programming Interfaces (APIs), annotation platforms, and auto-labelling tools for large-scale, crowd-sourced data curation, especially for Indian languages. Additionally, an AI fine-tuning marketplace should enable developers to adapt foundation models, alongside an AI Model Registry to track and govern AI deployment across key sectors.

RESPONSIBLE AI

One of the key aspects of greater acceptability of LLMs is to ensure that they adhere to certain Responsible AI principles and can give assurances of some notion of safety. While discussions around fairness, bias, and digital accessibility are common in global AI policy circles, India has many nuanced notions of fairness and even stereotypes, and specific sociocultural considerations that necessitate tailored solutions. For instance, AI systems have to be trained not only in local languages but also in multiple regional dialects and accents. Fairness in AI must be adapted to reflect India’s diverse and complex demographic realities. For instance, AI’s impact on employment and labour markets in India needs thorough investigation. The country’s large informal workforce and uneven digital literacy present unique challenges that global AI policies do not fully address. Under research in Responsible AI, we need significant efforts to build and test AI models for fairness and non-prejudicial language.

Even the latest foundation models, such as DeepSeek and OpenAI’s newest releases, continue to face fundamental limitations of transformer-based architectures.

Even the latest foundation models, such as DeepSeek and OpenAI’s newest releases, continue to face fundamental limitations of transformer-based architectures. These models are restricted by the data they have been trained on and struggle with truly novel questions, as seen in simple reasoning tasks. They can still perform unpredictably poorly when encountering data that fall outside their training set. India should actively fund research in improved transformer architectures, multimodal AI, efficient AI training, and reinforcement learning methodologies.

It is prudent for India’s AI development strategy to be multi-faceted, balancing long-term foundational investments with short-term pragmatic solutions.

AI HARDWARE, INFRASTRUCTURE

In January 2025, the IndiaAI Mission announced the establishment of a robust AI computing infrastructure comprising over 18,000 graphic processing units (GPUs). Private firms like Jio and Tata are also investing in AI cloud solutions in collaboration with NVIDIA. In the long run, as an important step towards self-reliance, India has to develop domestic semiconductor capabilities, so that the country does not remain vulnerable to supply-chain disruptions and geopolitical constraints. The ongoing investments in India’s semiconductor industry are promising. Establishing synergy between AI research labs and semiconductor manufacturing will be crucial in achieving technological self-sufficiency.

In conclusion, India’s AI development strategy must be multi-faceted, balancing long-term foundational investments with short-term pragmatic solutions. Developing domain-specific AI applications and smaller language models can provide immediate value while the country works towards more ambitious goals. Efforts to generate high-quality Indian datasets, refine training methodologies, and advance research in Responsible AI must be accelerated. Equally important is the need to invest in indigenous computing infrastructure, ensuring that India can build and deploy AI solutions without excessive reliance on external resources.

See also:

AI for a reason
Teaching AI to reason
The open-source edge in AI
The future of AI is not LLMs: Yann LeCun

LEAVE A COMMENT

Search by Keywords, Topic or Author

© 2025 IIT MADRAS - All rights reserved

Powered by ADK RAGE