AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms

Non-profit AI research group EleutherAI scraped YouTube subtitles to create a dataset in violation of YouTube’s terms of service, ProofNews said on July 16.

The dataset, called the Pile, allegedly includes subtitles of 173,536 YouTube videos from over 48,000 channels. About 12,000 deleted videos are part of the dataset.

Several top tech and AI firms, including Anthropic, have since used the Pile for training. Anthropic spokesperson Jennifer Martinez said the dataset includes “a very small subset of YouTube subtitles” but declined to comment on possible violations of YouTube’s terms of service.

Business software firm Salesforce also used the dataset. Salesforce VP of AI research Caiming Xiong said the dataset was “publicly available” and that Salesforce used it for academic and research purposes. ProofNews said Salesforce eventually released the same dataset publicly.

Apple used the Pile to train OpenELM, an efficient language model for on-device AI. Nvidia, Bloomberg, and Databricks also used the Pile for AI training.

ProofNews said its list of companies that used the dataset is not comprehensive, as companies do not always disclose which datasets they use in AI training.

Dataset contains crypto channels, more

ProofNews’ search tool indicates that Pile includes videos from crypto channels and creators, including Coinbase, Cointelegraph, Bitcoin Magazine, BitBoy Crypto, 99Bitcoins, Ivan On Tech, and Andreas Antonopolous.

ProofNews highlighted that the dataset includes transcripts from major news channels, education channels, late-night shows, popular YouTube hosts, and other categories. The Pile dataset extends beyond YouTube to other websites and online content.

ProofNews noted an earlier report from the New York Times, which said OpenAI and Google had previously harvested YouTube text. Google, which owns YouTube, said the action was permissible due to its agreement with users. OpenAI did not confirm or deny the report.

AI copyright disputes are far-reaching. Law firm Baker Hoestler lists at least fifteen lawsuits involving tech firms such as Anthropic, Meta, GitHub, Stability AI, Nvidia, and Google. OpenAI faces high-profile lawsuits from Mother Jones’ parent company and The New York Times.

The post AI training dataset used by tech giants allegedly created by scraping YouTube videos in violation of terms appeared first on CryptoSlate.

editorial staff

News

ChatGPT vs X: Which is better at first spotting the next big crypto narrative?

News

Ripple (XRP) Price Today, Ripple Stock Price, XRP/USD , XRP ETF and News

News

Meme Coin Cryptos on the Run Again: BONK, and WIF Lead the Charge

News

Ripple’s RLUSD Makes Historic Debut as First Bank-Backed Stablecoin Goes Live with AMINA

News

MOODENG Price Prediction – Will It Hit a New High After 40% Surge?

News

Binance CEO Delivers Five Iconic Success Nuggets: Details

Name	Price	24H (%)
DSLA Protocol (DSLA)	€0.000000	-6.88%
Lympo (LYM)	€0.000000	-4.43%
YAM v2 (YAMV2)	€0.000000	-1.41%
PolkaBridge (PBR)	€0.000000	-7.02%
Bitball (BTB)	€0.000000	0.37%
Cornichon (CORN)	€0.000000	-0.86%
Stacy (STACY)	€0.000000	0.00%
Lunch Money (LMY)	€0.000000	0.00%
Relevant (REL)	€0.000000	1.67%
Heart Number (HTN)	€0.000000	-30.47%

Name	Price	24H (%)
DSLA Protocol (DSLA)	$0.003679	-6.88%
Lympo (LYM)	$0.004392	-4.43%
YAM v2 (YAMV2)	$4.70	-1.41%
PolkaBridge (PBR)	$0.439876	-7.02%
Bitball (BTB)	$0.001977	0.37%
Cornichon (CORN)	$0.073096	-0.86%
Stacy (STACY)	$0.000710	0.00%
Lunch Money (LMY)	$0.000418	0.00%
Relevant (REL)	$0.79	1.67%
Heart Number (HTN)	$0.000553	-30.47%

#bitcoin

FXMETERS @fxmeters·

8 Sep 2021

Saxo Bank Review | FX Meters https://www.fxmeters.com/reviews/saxo-bank-review/?utm_source=ReviveOldPost&utm_medium=social&utm_campaign=ReviveOldPost #trading #ethereum #bitcoin #forex #ethereum #crypto #cryptocurrency #forextrading #btc

Reply on Twitter 1435611188856070149 Retweet on Twitter 1435611188856070149 Like on Twitter 1435611188856070149 Twitter 1435611188856070149

Le Renard ➐ @Le_renardy·

8 Sep 2021

Un pump comme tu en a jamais vu.

Contrat : 0xd39a081b9d368fca3d90054a5d78478776c8909b

(ceci n'est pas un conseil financier)

#LegalLeaf #crypto #btc #bitcoin #pump #ATH

Reply on Twitter 1435611188289736707 Retweet on Twitter 1435611188289736707 Like on Twitter 1435611188289736707 Twitter 1435611188289736707

Cardano Rainbow Pool @rainbow_pool·

8 Sep 2021

Wants to know how to stake $ADA for passive income?

🔴Check this 👇👇👇 out.

https://youtube.com/watch?v=C_gSUIEm-JA

#staking #Cardano #CardanoADA #cryptocurrencies #Crypto $ADA #Bitcoin #Binance #etoro #passiveincome #RainbowPool #CRBP

Reply on Twitter 1435611186570072064 Retweet on Twitter 1435611186570072064 Like on Twitter 1435611186570072064 Twitter 1435611186570072064

Diana Verónica y Tony @dianavytony·

8 Sep 2021

🎙️ ((AL AIRE)) Conversamos esta mañana con @Beiioso sobre la implementación del #Bitcoin en El Salvador.

🔴EN VIVO » 105.3 FM @Punto105 📻 #TuneIn Punto 105📲 7850-2060 📲 https://fb.watch/7Uwhzypgfr/

Reply on Twitter 1435611182585483265 Retweet on Twitter 1435611182585483265 Like on Twitter 1435611182585483265 Twitter 1435611182585483265

#crypto

Crypto Exchange Listings | CoinListingRush @CoinListingRush·

3 Apr 2023

.@LiqsRush provides realtime perpetual future liquidations from the most popular #crypto exchanges.

Reply on Twitter 1642905292512821249 Retweet on Twitter 1642905292512821249 Like on Twitter 1642905292512821249 Twitter 1642905292512821249

PayGG @PayGG5·

3 Apr 2023

@Ralvero The most strongest and powerful crypto community is @LNRDAO which is trying all it possible best to get it project $LNR to the top and you can check it out through this link https://discord.gg/lnr @LNR @LNRCrystalNFTs #btc #bnb #doge #bscgems #crypto

Reply on Twitter 1642905283662938116 Retweet on Twitter 1642905283662938116 Like on Twitter 1642905283662938116 Twitter 1642905283662938116