# Statistical database

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Statistical_database
> Markdown URL: https://mediated.wiki/source/Statistical_database.md
> Source: https://en.wikipedia.org/wiki/Statistical_database
> Source revision: 1268465996
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

Database used for statistical analysis purposes

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Statistical database" – news · newspapers · books · scholar · JSTOR (December 2017) (Learn how and when to remove this message) The topic of this article may not meet Wikipedia's general notability guideline. Please help to demonstrate the notability of the topic by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond a mere trivial mention. If notability cannot be shown, the article is likely to be merged, redirected, or deleted. Find sources: "Statistical database" – news · newspapers · books · scholar · JSTOR (July 2024) (Learn how and when to remove this message) (Learn how and when to remove this message)

A **statistical database** is a [database](/source/Database) used for statistical analysis purposes. It is an [OLAP](/source/OLAP) (online analytical processing), instead of [OLTP](/source/OLTP) (online transaction processing) system. Modern decision, and classical statistical databases are often closer to the [relational model](/source/Relational_model) than the [multidimensional](/source/Multidimensional_database) model commonly used in [OLAP](/source/OLAP) systems today.

Statistical databases typically contain parameter data and the measured data for these parameters. For example, parameter data consists of the different values for varying conditions in an experiment (e.g., temperature, time). The measured data (or variables) are the measurements taken in the experiment under these varying conditions.

Many statistical databases are sparse with many null or zero values. It is not uncommon for a statistical database to be 40% to 50% sparse. There are two options for dealing with the sparseness: (1) leave the null values in there and use compression techniques to squeeze them out or (2) remove the entries that only have null values.

Statistical databases often incorporate support for advanced statistical analysis techniques, such as correlations, which go beyond [SQL](/source/SQL). They also pose unique [security](/source/Security) concerns, which were the focus of much research, particularly in the late 1970s and early to mid-1980s.

## Privacy in statistical databases

In a statistical database, it is often desired to allow query access only to aggregate data, not individual records. Securing such a database is a difficult problem, since intelligent users can use a combination of aggregate queries to derive information about a single individual.

Some common approaches are:

- only allowing aggregate queries (SUM, COUNT, AVG, STDEV, etc.)

- rather than returning exact values for sensitive data like income, only return which partition it belongs to (e.g. 35k-40k)

- return imprecise counts (e.g. rather than 141 records met query, only indicate 130-150 records met it.)

- do not allow overly selective WHERE clauses

- audit all users queries, so users using system incorrectly can be investigated

- use intelligent agents to detect automatically inappropriate system use

For many years, research in this area was stalled, and it was thought in 1980 that, to quote:

- The conclusion is that statistical databases are almost always subject to compromise. Severe restrictions on allowable query set sizes will render the database useless as a source of statistical information but will not secure the confidential records.[1]

But in 2006, [Cynthia Dwork](/source/Cynthia_Dwork) defined the field of [differential privacy](/source/Differential_privacy), using work that started appearing in 2003. While showing that some semantic security goals, related to work of [Tore Dalenius](/source/Tore_Dalenius), were impossible, it identified new techniques for limiting the increased privacy risk resulting from inclusion of private data in a statistical database. This makes it possible in many cases to provide very accurate statistics from the database while still ensuring high levels of privacy.[2][3]

## References

1. **[^](#cite_ref-1)** Dorothy E. Denning, Peter J. Denning, and Mayer D. Schwartz, "The Tracker: A Threat to Statistical Database Security", *ACM Transactions on Database Systems (TODS)*, volume 4, issue 1 (March 1979), pages: 76-96, [doi](/source/Doi_(identifier)):[10.1145/320064.320069](https://doi.org/10.1145%2F320064.320069).

1. **[^](#cite_ref-2)** Hilton, Michael. ["Differential Privacy: A Historical Survey"](https://web.archive.org/web/20170301180826/https://pdfs.semanticscholar.org/4c99/097af05e8de39370dd287c74653b715c8f6a.pdf) (PDF). [S2CID](/source/S2CID_(identifier)) [16861132](https://api.semanticscholar.org/CorpusID:16861132). Archived from [the original](https://pdfs.semanticscholar.org/4c99/097af05e8de39370dd287c74653b715c8f6a.pdf) (PDF) on 2017-03-01.

1. **[^](#cite_ref-3)** Dwork, Cynthia (2008-04-25). ["Differential Privacy: A Survey of Results"](https://www.microsoft.com/en-us/research/publication/differential-privacy-a-survey-of-results/). In Agrawal, Manindra; Du, Dingzhu; Duan, Zhenhua; Li, Angsheng (eds.). *Theory and Applications of Models of Computation*. Lecture Notes in Computer Science. Vol. 4978. Springer Berlin Heidelberg. pp. 1–19. [doi](/source/Doi_(identifier)):[10.1007/978-3-540-79228-4_1](https://doi.org/10.1007%2F978-3-540-79228-4_1). [ISBN](/source/ISBN_(identifier)) [9783540792277](https://en.wikipedia.org/wiki/Special:BookSources/9783540792277) – via Microsoft.

## Further reading

An important series of conferences in this field:

- [Statistical and Scientific Database Management (SSDBM)](http://www.informatik.uni-trier.de/~ley/db/conf/ssdbm/)

Some key papers in this field:

1. [doi](/source/Doi_(identifier)):[10.1145/320613.320616](https://doi.org/10.1145%2F320613.320616) - Dorothy E. Denning, Secure statistical databases with random sample queries, ACM Transactions on Database Systems (TODS), Volume 5, Issue 3 (September 1980), Pages: 291 - 315

1. [doi](/source/Doi_(identifier)):[10.1145/319830.319834](https://doi.org/10.1145%2F319830.319834) - Wiebren de Jonge, Compromising statistical databases responding to queries about means, ACM Transactions on Database Systems, Volume 8, Issue 1 (March 1983), Pages: 60 - 80

1. [doi](/source/Doi_(identifier)):[10.1145/320128.320138](https://doi.org/10.1145%2F320128.320138) - Dorothy E. Denning, Jan Schlörer, A fast procedure for finding a tracker in a statistical database, ACM Transactions on Database Systems, Volume 5, Issue 1 (March 1980) . Pages: 88 - 102

1. A. Shoshani, “Statistical Databases: Characteristics, Problems, and some Solutions,” in Proceedings of the 8th International Conference on Very Large Data Bases, San Francisco, CA, USA, 1982, pp. 208–222.

Authority control databases: National Czech Republic

---
Adapted from the Wikipedia article [Statistical database](https://en.wikipedia.org/wiki/Statistical_database) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Statistical_database?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.
