# Datasaurus dozen

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Datasaurus_dozen
> Markdown URL: https://mediated.wiki/source/Datasaurus_dozen.md
> Source: https://en.wikipedia.org/wiki/Datasaurus_dozen
> Source revision: 1353914246
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Collection of statistical data sets}}
{{Data Visualization}}
The '''Datasaurus dozen''' comprises [thirteen](/source/Baker's_dozen) [data set](/source/data_set)s that have nearly identical simple [descriptive statistics](/source/descriptive_statistics) to two decimal places, yet have very different [distributions](/source/Probability_distribution) and appear very different when [graphed](/source/Plot_(graphics)).<ref name="Matejka2017">{{Cite book |last1=Matejka |first1=Justin |last2=Fitzmaurice |first2=George |chapter=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing |date=2017-05-02 |title=Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems |chapter-url=https://doi.org/10.1145/3025453.3025912 |series=CHI '17 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=1290–1294 |doi=10.1145/3025453.3025912 |isbn=978-1-4503-4655-9 |archive-url=https://research.autodesk.com/app/uploads/2023/03/same-stats-different-graphs.pdf_rec2hRjLLGgM7Cn2T.pdf |archive-date=2017-05-02}}</ref> It was inspired by the smaller [Anscombe's quartet](/source/Anscombe's_quartet) that was created in 1973.

== Data ==
The following table contains summary statistics for all thirteen data sets.
{| class="wikitable"
!Property
!Value
!Accuracy
|-
|Number of elements
|142
|exact
|-
|[Mean](/source/Mean) of ''x''
|54.26
|to 2 decimal places
|-
|Sample [variance](/source/variance) of ''x'': ''s''{{supsub|2|''x''}}
|16.76
|to 2 decimal places
|-
|Mean of ''y''
|47.83
|to 2 decimal places
|-
|Sample variance of ''y'': ''s''{{supsub|2|''y''}}
|26.93
|to 2 decimal places
|-
|[Correlation](/source/Correlation) between ''x'' and ''y''
| −0.06
|to 3 decimal places
|-
|[Linear regression](/source/Linear_regression) line
|''y''&nbsp;=&nbsp;53&nbsp;−&nbsp;0.1''x''
|to 0 and 1 decimal places, respectively
|-
|[Coefficient of determination](/source/Coefficient_of_determination) of the linear regression: <math>R^2</math>
|0.004
|to 3 decimal places
|}
alt=thirteen graphs of the datasets in the Datasaurus Dozen, visualized graphically and also summarized numerically to show their statistical summaries are similar, while their graphical representations are not similar|thumb|318x318px|The thirteen data sets in the Datasaurus Dozen, visualized and summarized
The thirteen data sets were labeled as the following:
{{div col|colwidth=8em}}
* away
* bullseye
* circle
* dino
* dots
* h_lines
* high_lines
* slant_down
* slant_up
* star
* v_line
* wide_lines
* x_shape
{{div col end}}
Similar to [Anscombe's quartet](/source/Anscombe's_quartet), the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets.<ref>{{cite book |last=Elert |first=Glenn |year=2021 |chapter=Linear Regression - Practice |chapter-url=http://physics.info/linear-regression/practice.shtml#4 |title=The Physics Hypertextbook}}</ref><ref>{{cite book |last=Janert |first=Philipp K. |url=https://archive.org/details/isbn_9780596802356/page/65 |title=Data Analysis with Open Source Tools |publisher=[O'Reilly Media](/source/O'Reilly_Media) |year=2010 |isbn=978-0-596-80235-6 |pages=[https://archive.org/details/isbn_9780596802356/page/65 65–66]}}</ref><ref>{{cite book |last1=Chatterjee |first1=Samprit |title=Regression Analysis by Example |last2=Hadi |first2=Ali S. |publisher=John Wiley and Sons |year=2006 |isbn=0-471-74696-7 |page=91}}</ref><ref>{{cite book |last1=Saville |first1=David J. |title=Statistical Methods: The geometric approach |last2=Wood |first2=Graham R. |publisher=[Springer](/source/Springer_Science%2BBusiness_Media) |year=1991 |isbn=0-387-97517-9 |page=418}}</ref><ref name="Matejka2017" /><ref>{{cite book |last=Tufte |first=Edward R. |url=https://archive.org/details/visualdisplayofq00tuft |title=The Visual Display of Quantitative Information |publisher=Graphics Press |year=2001 |isbn=0-9613921-4-2 |edition=2nd |location=Cheshire, CT |authorlink=Edward Tufte}}</ref>

== Creation ==
[[File:Datasaurus.png|thumb|The dinosaur data set created by [Alberto Cairo](/source/Alberto_Cairo) that inspired the creation of the Datasaurus Dozen]]The first data set, in the shape of a [Tyrannosaurus](/source/Tyrannosaurus), that inspired the rest of the "datasaurus" data set was constructed in 2016 by [Alberto Cairo](/source/Alberto_Cairo).<ref name="Cairo">{{Cite web |last=Cairo |first=Alberto |title=Download the Datasaurus: Never trust summary statistics alone; always visualize your data |url=http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html |access-date=2024-02-01 |archive-date=2024-06-20 |archive-url=https://web.archive.org/web/20240620205540/http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html |url-status=dead }}</ref><ref>{{Cite web |last=Murtagh |first=Jack |date=2024-02-01 |title=What This Graph of a Dinosaur Can Teach Us about Doing Better Science |url=https://www.scientificamerican.com/article/what-this-graph-of-a-dinosaur-can-teach-us-about-doing-better-science/ |access-date=2024-03-08 |website=Scientific American |language=en}}</ref> It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus".<ref name="Cairo" />

This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at [Autodesk](/source/Autodesk). Unlike Anscombe's quartet, where it is not known how the data set was generated,<ref name="ChatterjeeFirat">{{cite journal |last1=Chatterjee |first1=Sangit |last2=Firat |first2=Aykut |year=2007 |title=Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset |journal=[The American Statistician](/source/The_American_Statistician) |volume=61 |issue=3 |pages=248–254 |doi=10.1198/000313007X220057 |jstor=27643902 |s2cid=121163371}}</ref> the authors used [simulated annealing](/source/simulated_annealing) to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete.<ref name="Matejka2017" />

The [pseudocode](/source/pseudocode) for this algorithm is as follows:
<syntaxhighlight lang="text">
current_ds ← initial_ds
for x iterations, do:
    test_ds ← perturb(current_ds, temp)
    if similar_enough(test_ds, initial_ds):
        current_ds ← test_ds

function perturb(ds, temp):
    loop:
        test ← move_random_points(ds)
        if fit(test) > fit(ds) or temp > random():
            return test
</syntaxhighlight>

where
* <code>initial_ds</code> is the seed data set
* <code>current_ds</code> is the latest version of the data set
* <code>fit()</code> is a function used to check whether moving the points gets closer to the desired shape
* <code>temp</code> is the temperature of the simulated annealing algorithm
* <code>similar_enough()</code> is a function that checks whether the statistics for the two given data sets are similar enough
* <code>move_random_points()</code> is a function that randomly moves data points

== See also ==
* [Exploratory data analysis](/source/Exploratory_data_analysis)
* [Goodness of fit](/source/Goodness_of_fit)
* [Regression validation](/source/Regression_validation)
* [Simpson's paradox](/source/Simpson's_paradox)
* [Statistical model validation](/source/Statistical_model_validation)
* [Anscombe's quartet](/source/Anscombe's_quartet)

==References==
{{Reflist}}

==External links==
* [https://www.autodeskresearch.com/publications/samestats Animated examples from Autodesk] for the Datasaurus Dozen datasets
* [https://cran.r-project.org/package=datasauRus datasauRus], datasets from the Datasaurus Dozen in [R](/source/R_(programming_language))
* The Datasaurus Dozen in [CSV](/source/Comma-separated_values) and [tab-delimited files](/source/Tab-separated_values) https://www.openintro.org/data/index.php?data=datasaurus

Category:Misuse of statistics
Category:Statistical charts and diagrams
Category:Statistical data sets
Category:Data and information visualization

---
Adapted from the Wikipedia article [Datasaurus dozen](https://en.wikipedia.org/wiki/Datasaurus_dozen) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Datasaurus_dozen?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.