top of page

India needs to develop a large language model that reflects its linguistic, cultural tapestry

  • InduQin
  • Apr 14, 2023
  • 2 min read

In the rapidly evolving world of AI, large language models (LLMs) are revolutionising the way we interact with technology, driving unprecedented transformations in communication, knowledge dissemination and global connectivity. In this landscape, the development of an Indic LLM is not just a strategic move but an urgent necessity to promote India's linguistic and cultural heritage while bolstering economic growth and innovation. It also reduces reduces reliance on external technology, enhancing self-sufficiency and promoting technological sovereignty.


The current landscape of LLM development is dominated by US-centric models. These models, primarily designed, built and tested in English, impose a broad, American value system around the world, akin to the global impact of social media platforms like Facebook, Google and Twitter. The risk of perpetuating this US-centric paradigm is immense, threatening to eclipse the linguistic and cultural diversity that makes our world a richer and more inclusive place.


OpenAI's GPT-4 (Generative Pre-trained Transformer 4) is trained on a diverse and extensive dataset that spans multiple domains and sources. The training data includes web pages, books, articles and other text-based sources. For reference, GPT-3 was trained on 45 terabytes of text data, roughly equivalent of 500 billion tokens - words or sub-word tokens being both the ultimate input and output of LLMs. OpenAI has invested in the neighbourhood of $10 billion to 'train' GPT-4. To train an LLM of the GPT-4 class, one would need access to a cluster of high-performance computing, numbering into a few 1,000s Nvidia's A100 GPUs (graphics processing units) with large-scale memory and highspeed giga-scale networking.


The Nilekani Centre at I IT Madras' AI4Bharat has collected over 21 billion open text tokens and 230 million open parallel sentences in various Indian languages. It has also collected 100,000 hours of YouTube videos in Indian languages that have been published under Creative Commons (CC) 4.0 licence. However, even this massive data collection is about two orders of magnitude too small to build GPT-4-style models.


There are many sources of data relevant to India that can be unlocked to help train LLMs. Text sources include documents from literature, newspapers, public sources, parliaments, state legislatures, Acts and regulations, judiciary and textbooks. Audio and video sources that can contribute to language models include TV and radio programmes in various languages, movies, podcasts and YouTube videos. Automatic transcription methods may allow such data to be used for training LLMs on a large scale. If all these sources are leveraged, the amount of data available in Indian languages could increase by one order of magnitude.


Read More at https://economictimes.indiatimes.com/opinion/et-commentary/india-needs-to-develop-a-large-language-model-that-reflects-its-linguistic-cultural-tapestry/articleshow/99325316.cms

Comments


bottom of page