Common Crawl

{{Short description|Nonprofit web crawling and archive organization}} {{Infobox dot-com company | name = Common Crawl | company_type = 501(c)(3) non-profit | logo = Common Crawl logo.svg | traded_as = | foundation = 2007 | dissolved = | location = San Francisco, California; Los Angeles, California, United States | incorporated = | founder = Gil Elbaz | chairman = | president = | CEO = | MD = Rich Skrenta | key_people = | industry = | products = | services = | assets = | equity = | owner = | num_employees = | parent = | divisions = | subsid = | url = {{url|commoncrawl.org}} | ipv6 = | alexa = | website_type = | registration = | num_users = | language = | launch_date = | current_status = | screenshot = | caption = | footnotes = | license = Apache 2.0 (software) {{clarify|reason=dataset license?|date=November 2024}} }}

The '''Common Crawl Foundation''' ('''Common Crawl''') is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.<ref name="latimes">{{cite news |title=Tech entrepreneur Gil Elbaz made it big in L.A.|author=Rosanna Xia|work=Los Angeles Times|date=February 5, 2012|access-date=July 31, 2014|url=https://www.latimes.com/business/la-xpm-2012-feb-05-la-fi-himi-elbaz-20120205-story.html|quote=his nonprofit Common Crawl Foundation 'seeks to make a copy of the Web accessible to a data scientist, or to a start-up or to a researcher or to an analyst that just wants to improve the world.'}}</ref><ref name="pressheretv">{{cite news |title=Gil Elbaz and Common Crawl|work=NBC News|date=April 4, 2013|access-date=July 31, 2014|url=http://www.pressheretv.com/gil-elbaz-and-common-crawl/|archive-url=https://web.archive.org/web/20130413035956/http://www.pressheretv.com/gil-elbaz-and-common-crawl/|archive-date=April 13, 2013|url-status=dead}}</ref>

Common Crawl was founded by Gil Elbaz.<ref name="latimes" /><ref name="pressheretv" /> The data had mostly been primarily used by researchers and some startups until the 2020s, when AI companies started training large language models using the data.<ref name=":1">{{Cite web |last=Reisner |first=Alex |date=2025-11-04 |title=The Company Quietly Funneling Paywalled Articles to AI Developers |url=https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/ |access-date=2025-11-14 |website=The Atlantic |language=en}}</ref> In November 2025, an investigation by ''The Atlantic'' revealed that Common Crawl misled publishers when it claimed it respected paywalls in its scraping and it was not honoring requests from publishers to have their content removed from its databases.<ref name=":1" />

==History== Common Crawl was founded in 2007 in San Francisco.<ref name=":3">{{Cite news |last=Knibbs |first=Kate |title=Publishers Target Common Crawl In Fight Over AI Training Data |url=https://www.wired.com/story/the-fight-against-ai-comes-to-a-foundational-data-set/ |access-date=2025-12-10 |work=Wired |language=en-US |issn=1059-1028|date=June 13, 2024}}</ref> It began publishing its crawls in 2011.<ref>{{Cite journal |last=Potts |first=Jason |last2=Torrance |first2=Andrew |last3=Harhoff |first3=Dietmar |last4=von Hippel |first4=Eric |date=March 2024 |title=Profiting from Data Commons: Theory, Evidence, and Strategy Implications |url=https://pubsonline.informs.org/doi/10.1287/stsc.2021.0080 |journal=Strategy Science |language=en |volume=9 |issue=1 |pages=1–17 |doi=10.1287/stsc.2021.0080 |issn=2333-2050}}</ref>{{Additional citation needed|date=May 2026}}

By 2013, sites like TinEye were building their products off of Common Crawl.<ref name=":4">{{Cite web |last=Brandom |first=Russell |date=2013-03-01 |title=Common Crawl: going after Google on a non-profit budget |url=https://www.theverge.com/2013/3/1/4043374/common-crawl-going-after-google-on-a-non-profit-budget |access-date=2025-12-10 |website=The Verge |language=en-US}}</ref><ref name="technologyreview" /> The crawl reduces the reliance of companies and researchers on Google, which has the biggest dataset.<ref name=":4" /><ref name="technologyreview" /> Common Crawl was designed to have more and fresher data that was more efficient to analyze and utilize than the Wayback Machine created by the Internet Archive.<ref name="technologyreview" /><ref name=":4" />

By 2015, 1.8 billion webpages were on the Common Crawl, which started by crawling a list of URLs donated by the search engine Blekko.<ref name=":5">{{Cite journal |last=Hayes |first=Brian |date=2015 |title=Computing Science: Crawling toward a Wiser Web |url=https://www.jstor.org/stable/43707766 |journal=American Scientist |volume=103 |issue=3 |pages=184–187 |issn=0003-0996}}</ref> They use Amazon Web Services, which provides some of its services for free, allowing computing costs to average $2-4000/month.<ref name=":5" /> The Common Crawl website listed 30 studies based on Common Crawl data.<ref name=":5" />

Before 2023, Common Crawl was not very well known outside of academic researchers who utilize the data.<ref name=":3" /> Common Crawl received its first requests to redact information in 2023 and increasingly started seeing its crawler, CCBot, blocked.<ref name=":3" /> In 2023, it began receiving significant financial support from AI companies, including Anthropic and OpenAI, each of which donated $250,000.<ref name=":1" /> It was also used to train Google DeepMind's large language model Gemini.<ref>{{cite news |last=Cuesta |first=Albert |title=L'FBI, a la caça del web arxivat que incomoda els mitjans |work=Ara |date=15 November 2025 |accessdate=2 March 2026 |url=https://www.ara.cat/media/l-fbi-caca-web-arxivat-incomoda-mitjans_1_5561784.html |archive-url=https://web.archive.org/web/20251117060008mp_/https://www.ara.cat/media/l-fbi-caca-web-arxivat-incomoda-mitjans_1_5561784.html |archive-date=17 November 2025 |url-status=live |lang=ca}}</ref> By April 2023, Common Crawl was capturing 3.1 billion webpages, with an estimated 5% of pages before 2021 containing hate speech or slurs.<ref>{{Cite journal |last=Soos |first=Carlin |last2=Haroutunian |first2=Levon |date=2024 |title=On the Question of Authorship in Large Language Models |url=https://www.imrpress.com/journal/KO/51/2/10.5771/0943-7444-2024-2-83 |journal=Knowledge Organization |volume=51 |issue=2 |pages=83–95 |doi=10.5771/0943-7444-2024-2-83 |issn=0943-7444}}</ref>

As of 2024, Common Crawl had been cited in more than 10,000 academic studies.<ref name=":2">{{Cite news |last=Roose |first=Kevin |date=July 19, 2024 |title=The Data That Powers A.I. Is Disappearing Fast |url=https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html |work=New York Times}}</ref> By 2024, The Pile and Common Crawl had been the two main training datasets being used to train AI models.<ref name=":42">{{Cite news |last=Gilbertson|first=Annie|title=Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI|url=https://www.wired.com/story/youtube-training-data-apple-nvidia-anthropic/|access-date=2026-01-29|work=Wired|language=en-US|issn=1059-1028|date=July 16, 2024}}</ref><ref>{{Cite web |last=Anthony |first=Aubra |last2=Sharma |first2=Lakshmee |last3=Noor |first3=Elina |date=2024-04-30 |title=Advancing a More Global Agenda for Trustworthy Artificial Intelligence |url=https://carnegieendowment.org/research/2024/04/advancing-a-more-global-agenda-for-trustworthy-artificial-intelligence |access-date=2026-01-29 |website=Carnegie Endowment for International Peace |language=en}}</ref>

In November 2025, an investigation by technology journalist Alex Reisner for ''The Atlantic'' revealed that Common Crawl misled publishers when it claimed it respected paywalls in its scraping and when it said that it was honoring requests from publishers to have their content removed from its databases.<ref name=":1" /> It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies.<ref name=":1" /> As of 2025, Reisner found that CCBot was the most widely-blocked bot by the top 1000 websites.<ref name=":1" />

A 2026 article in LWN.net discussed an advantage to services like Common Crawl being that it can limit the scraping costs to websites by allowing companies and researchers to download the data from Common Crawl instead of scraping it themselves.<ref>{{Cite news |last=Alden |first=Daroc |date=2026-02-12 |title=Poisoning scraperbots with iocaine |url=https://lwn.net/Articles/1056953/ |access-date=2026-05-09 |work=LWN.net |language=en-US}}</ref>

In April 2026, Common Crawl experimentally began to distribute its data through Hugging Face Storage Bucket, in addition to its standard storage on Amazon S3. <ref>{{Cite web |title=Common Crawl - Blog - April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket |url=https://commoncrawl.org/blog/april-2026-crawl-archive-now-available-in-a-hugging-face-storage-bucket |access-date=2026-05-24 |website=commoncrawl.org |language=en}}</ref>

== Organization == Peter Norvig and Joi Ito have served on the advisory board.<ref name="technologyreview">{{cite news |author=Tom Simonite |date=January 23, 2013 |title=A Free Database of the Entire Web May Spawn the Next Google |url=http://www.technologyreview.com/news/509931/a-free-database-of-the-entire-web-may-spawn-the-next-google/ |url-status=dead |archive-url=https://web.archive.org/web/20140626114525/http://www.technologyreview.com/news/509931/a-free-database-of-the-entire-web-may-spawn-the-next-google/ |archive-date=June 26, 2014 |access-date=July 31, 2014 |publisher=MIT Technology Review}}</ref> Rich Skrenta is the executive director.<ref name=":1" />

It has received funding almost exclusively from the Elbaz Family Foundation Trust until 2023 when it started receiving donations from the AI industry.<ref name=":1" />

== Refined versions == A number of organizations take raw Common Crawl data and refine it into datasets that exclude edgy content or are otherwise higher-quality for their purposes, such as FineWeb, DCLM and C4.<ref name=":1" />

=== Colossal Clean Crawled Corpus === Google version of the Common Crawl is called the '''Colossal Clean Crawled Corpus''', or '''C4''' for short. It was constructed for the training of the T5 language model series in 2019.<ref name=":0" /> As of 2023, there were some concerns over copyrighted content in the C4 as well as racist content.<ref>{{Cite news |last=Hern |first=Alex |date=2023-04-20 |title=Fresh concerns raised over sources of training material for AI systems |url=https://www.theguardian.com/technology/2023/apr/20/fresh-concerns-training-material-ai-systems-facist-pirated-malicious |access-date=2023-04-21 |work=The Guardian |language=en-GB |issn=0261-3077}}</ref><ref name=":0">{{Cite web |last=Quach |first=Katyanna |date=2023-04-20 |title=Google's C4 ML training data drew from 4chan, racist sources |url=https://www.theregister.com/software/2023/04/20/googles-c4-ml-training-data-drew-from-4chan-racist-sources/1441370 |access-date=2026-05-09 |website=theregister |language=en-US}}</ref> A 2024 study found that 45% of content was explicitly restricted by websites' terms of service to be used for purposes like AI training by for-profit companies.<ref name=":2" />

==References== {{reflist}}

==External links== * {{Official website|https://commoncrawl.org/}} * [https://github.com/commoncrawl/ Common Crawl GitHub Repository] with the crawler, libraries and example code * [https://projects.propublica.org/nonprofits/organizations/261635908 ProPublica Nonprofit Explorer page] - houses 990 filings

Category:Internet-related organizations Category:Web archiving Category:Web archiving initiatives Category:Open-source artificial intelligence