AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI


SUMMARY

Cofounder Mitesh Khapra claimed that AI4Bharat has “gone to almost every district in the country” and tried to cover almost all the 22 official languages in the past three years

Khapra added that several startups and academic institutes are using AI4Bharat’s data to build their own models to accelerate the “adoption of language technologies”

AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions

IIT Madras-incubated artificial intelligence (AI) lab, AI4Bharat, is reportedly collecting 10 Tn tokens of language data to build the “next generation of AI services”.

For context, tokens are basic units of input and output for large language models (LLMs), and are a unit of text that can be a word, character, or subword. 

As per Economic Times, AI4Bharat cofounder Mitesh Khapra claimed that the platform has “gone to almost every district in the country” and “tried to cover almost all the 22 official languages” in the past three years.

AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions. 

Noting that the platform has built the tools required for data collection from scratch, Khapra added that several startups, academic institutes and deeptech institutes are using the company’s data to build their own models to accelerate the “adoption of language technologies”.

“Our data, models and scripts are open sourced. You can build on top of that,” he said.

Khapra added that the data collected over the past three years will be fed into the “Ten Trillion Token” project.

“This is going to be required to make sure that we are able to build native Indic models that support Indian languages and not as an afterthought. We want to collect 10 Tn tokens in Indian languages that would be synthetic data that would be language information and cultural information,” he added. 

He also noted that the data, collected as part of the project, will also have use cases spanning farmers, children, digital payments and agriculture. 

The comments came on the sidelines of an event organised by Aadhaar architect Nandan Nilekani-backed People+ai, which too has undertaken a project to collect 10 Tn language tokens scraped from formal government documents to conversations. 

The People+ai’s project is envisaged with building datasets, which are the fundamental for training AI foundational models. While there is plenty of content online in English (nearly 55% of all internet data), the paucity of content makes it difficult to train LLMs in local vernacular languages. 

However, AI4Bharat and People+ai are looking to solve this problem by building datasets from ground up that can capture the cultural context, script and grammatical rules. 

Khapra’s comments come a year after AI4Bharat launched its open-source speech dataset, called IndicVoices. Funded by the electronics and IT ministry’s Bhashini initiative and other non-profits, the dataset spans 22 Indian languages.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *