AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI

SUMMARY

Cofounder Mitesh Khapra claimed that AI4Bharat has “gone to almost every district in the country” and tried to cover almost all the 22 official languages in the past three years

Khapra added that several startups and academic institutes are using AI4Bharat’s data to build their own models to accelerate the “adoption of language technologies”

AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions

IIT Madras-incubated artificial intelligence (AI) lab, AI4Bharat, is reportedly collecting 10 Tn tokens of language data to build the “next generation of AI services”.

For context, tokens are basic units of input and output for large language models (LLMs), and are a unit of text that can be a word, character, or subword.

As per Economic Times, AI4Bharat cofounder Mitesh Khapra claimed that the platform has “gone to almost every district in the country” and “tried to cover almost all the 22 official languages” in the past three years.

AI4Bharat claims to have sourced the data from voice samples of users across several demographics and professions.

Noting that the platform has built the tools required for data collection from scratch, Khapra added that several startups, academic institutes and deeptech institutes are using the company’s data to build their own models to accelerate the “adoption of language technologies”.

“Our data, models and scripts are open sourced. You can build on top of that,” he said.

Khapra added that the data collected over the past three years will be fed into the “Ten Trillion Token” project.

“This is going to be required to make sure that we are able to build native Indic models that support Indian languages and not as an afterthought. We want to collect 10 Tn tokens in Indian languages that would be synthetic data that would be language information and cultural information,” he added.

He also noted that the data, collected as part of the project, will also have use cases spanning farmers, children, digital payments and agriculture.

The comments came on the sidelines of an event organised by Aadhaar architect Nandan Nilekani-backed People+ai, which too has undertaken a project to collect 10 Tn language tokens scraped from formal government documents to conversations.

The People+ai’s project is envisaged with building datasets, which are the fundamental for training AI foundational models. While there is plenty of content online in English (nearly 55% of all internet data), the paucity of content makes it difficult to train LLMs in local vernacular languages.

However, AI4Bharat and People+ai are looking to solve this problem by building datasets from ground up that can capture the cultural context, script and grammatical rules.

Khapra’s comments come a year after AI4Bharat launched its open-source speech dataset, called IndicVoices. Funded by the electronics and IT ministry’s Bhashini initiative and other non-profits, the dataset spans 22 Indian languages.

Source link

Top News

AI4Bharat Collecting 10 Tn Tokens To Build Next Generation Of AI

Leave a Reply

Latest Updates

Coinbase to Launch 24/7 Bitcoin, Ethereum Futures for US Customers

2-year Treasury yield ends at lowest level in 6 months on recession talk

10 ETFs to Build a Diversified Portfolio

DeFi tokens gain $3.5B despite BTC, ETH and XRP lead $640M crypto market capitulation

Random Updates

Baroda BNP Paribas MF starts dividend yield fund for steady returns | News on Markets

Asia Goes Risk Off, Baidu Issues Convertible Bonds, Week In Review

The DC Multiverse Ends as DC Comics Debuts New Mattel Partnership

Latest News

Tipperary church posts social media plea for return of missing parking tokens

Wyoming state plans stablecoin for Q1 2025

Stocks mostly rise as focus turns to US inflation data

China names and shames buyers of its government bonds

Editor's Choice

Gold-backed Tokens Underperform While Wall Street Calls for Dip Buying in Precious Metal

Price ceilings holding back commodities, reflation | Insights

Latham & Watkins Advises MicroStrategy on $563.4M Offering