Data orientation

'''Data orientation''' is the representation of tabular data in a linear memory model such as in-disk or in-memory. The two most common representations are '''column-oriented''' (columnar format) and '''row-oriented''' (row format).<ref name="vs2008">{{cite book|doi=10.1145/1376616.1376712|title=Proceedings of the 2008 ACM SIGMOD international conference on Management of data |chapter=Column-stores vs. Row-stores: How different are they really? |date=2008 |last1=Abadi |first1=Daniel J. |last2=Madden |first2=Samuel R. |last3=Hachem |first3=Nabil |pages=967–980 |isbn=978-1-60558-102-6 }}</ref><ref name="OLTPvsOLAP2012">{{cite journal| title=Compacting Transactional Data in Hybrid OLTP&OLAP Databases| doi=10.14778/2350229.2350258| journal=Proceedings of the VLDB Endowment| date=2012| last1=Funke| first1=Florian| last2=Kemper| first2=Alfons| last3=Neumann| first3=Thomas| volume=5| issue=11| pages=1424–1435}}</ref>

The choice of data orientation is a trade-off and an architectural decision in databases, query engines, and numerical simulations.<ref name="vs2008"/> As a result of these tradeoffs, row-oriented formats are more commonly used in online transaction processing (OLTP) and column-oriented formats are more commonly used in online analytical processing (OLAP).<ref name="OLTPvsOLAP2012"/>

Examples of column-oriented formats include Apache ORC,<ref name="orc">{{cite web|title=Apache ORC|url=https://orc.apache.org/|access-date=2024-05-21}}</ref> Apache Parquet,<ref name="parquet">{{cite web|title=Apache Parquet|url=https://parquet.apache.org/|access-date=2024-05-21}}</ref> Apache Arrow,<ref name="arrow">{{cite web|title=Apache Arrow|url=https://arrow.apache.org/|access-date=2024-05-21}}</ref> formats used by BigQuery, Amazon Redshift and Snowflake. Predominant examples of row-oriented formats include CSV, formats used in most relational databases (Oracle, MySQL etc.), the in-memory format of Apache Spark, and Apache Avro.<ref name="avro">{{cite web|title=Apache Avro|url=https://avro.apache.org/|access-date=2024-05-21}}</ref>

== Description ==

Tabular data is two dimensional — data is modeled as rows and columns. However, computer systems represent data in a linear memory model, both in-disk and in-memory.<ref>{{cite journal|doi=10.1016/j.diin.2014.05.011|journal=Digital Investigation|title=In lieu of swap: Analyzing compressed RAM in Mac OS X and Linux|date=2014 |doi-access=free |last1=Richard |first1=Golden G. |last2=Case |first2=Andrew |volume=11 |pages=S3–S12 }}</ref><ref name=design_book>{{cite book|title=Principles of Computer System Design|first=Jerome H. Saltzer|last=M. Frans Kaashoek|date=2009 |publisher=Morgan Kaufmann |isbn=978-0-12-374957-4}}</ref><ref>{{cite web|title=Chapter 4 Process Address Space (Linux kernel documentation)|url=https://www.kernel.org/doc/gorman/html/understand/understand007.html|access-date=2024-05-21}}</ref> Therefore, a table in a linear memory model requires mapping its two-dimensional scheme into a one-dimensional space. Data orientation is to the decision taken in this mapping. There are two prominent mappings: row-oriented and column-oriented.<ref name="vs2008"/><ref name="OLTPvsOLAP2012"/>

== Row-oriented ==

In a row-oriented database, also known as a rowstore, the elements of the table

{| class="wikitable" ! column 1 !! column 2 || column 3 |- | {{font color|black|yellow|item 11}} || {{font color|white|blue|item 12}} || {{font color|white|red|item 13}} |- | {{font color|black|yellow|item 21}} || {{font color|white|blue|item 22}} || {{font color|white|red|item 23}} |}

are stored linearly as

{| class="wikitable" | {{font color|black|yellow|item 11}} || {{font color|white|blue|item 12}} || {{font color|white|red|item 13}} || {{font color|black|yellow|item 21}} || {{font color|white|blue|item 22}} || {{font color|white|red|item 23}} |}

I.e. each row of the table is located one after the other. In this orientation, values in the same row are close in space (e.g. similar address in an addressable space).

=== Examples ===

* CSV * Postgres in-disk and in-memory formats * Apache Spark in-memory format * Apache Avro * MySQL

== Column-oriented ==

In a column-oriented database, also known as a columnstore, the elements of the table

are stored linearly as

{| class="wikitable" | {{font color|black|yellow|item 11}} || {{font color|black|yellow|item 21}} || {{font color|white|blue|item 12}} || {{font color|white|blue|item 22}} || {{font color|white|red|item 13}} || {{font color|white|red|item 23}} |}

I.e. each column of the table is located one after the other. In this orientation, values on the same column are close in space (e.g. similar address in an addressable space).

=== Examples ===

* BigQuery's in-memory and storage formats * Apache Parquet * Apache ORC * Apache Arrow * DuckDB in-memory format * Pandas in-memory format * R dataframes<ref>{{Cite web |title=R Coding Basics - 9 Data Frames |url=https://www.gastonsanchez.com/R-coding-basics/4-02-data-frames.html |access-date=2025-01-19 |website=www.gastonsanchez.com}}</ref>

See list of column-oriented DBMSes for more examples.

== Tradeoff ==

Data orientation is an important architectural decision of systems handling data because it results in important tradeoffs in performance and storage.<ref name=design_book/> Below are selected dimensions of this tradeoff.

=== Random access ===

Row-oriented benefits from fast random access of rows. Column-oriented benefits from fast random access of columns. In both cases, this is the result of fewer page or cache misses when accessing the data.<ref name="design_book"/>

=== Insert ===

Row-oriented benefits from fast insertion of a new row. Column-oriented benefits from fast insertion of a new column.

This dimension is an important reason why row-oriented formats are more commonly used in online transaction processing (OLTP), as it results in faster transactions in comparison to column-oriented.<ref name="OLTPvsOLAP2012"/>

=== Conditional access ===

Row-oriented benefits from fast access under a filter. Column-oriented benefits from fast access under a projection.<ref name=parquet/><ref name=orc/>

=== Compute performance ===

Column-oriented benefits from fast analytics operations. This is the result of being able to leverage SIMD instructions.<ref name=arrow/>

=== Uncompressed size ===

Column-oriented benefits from smaller uncompressed size. This is the result of the possibility that this orientation offers to represent certain data types with dedicated encodings.<ref name=parquet/><ref name=orc/>

For example, a table of 128 rows with a Boolean column requires 128 bytes in a row-oriented format (one byte per Boolean) but 128 bits (16 bytes) in a column-oriented format (via a bitmap). Another example is the use of run-length encoding to encode a column.

=== Compressed size ===

Column-oriented benefits from smaller compressed size. This is the result of a higher homogeneity within a column than within multiple rows.<ref name=parquet/><ref name=orc/>

== Conversion and interchange ==

Because both orientations represent the same data, it is possible to convert a row-oriented dataset to a column-oriented dataset and vice versa at the expense of compute. In particular, advanced query engines often leverage each orientation's advantages, and convert from one orientation to the other as part of their execution. As an example, an Apache Spark query may # Read data from Apache Parquet (column-oriented) # Load it into the Spark internal in-memory format (row-oriented) # Convert it to Apache Arrow for a specific computation (column-oriented) # Write it to Apache Avro for streaming (row-oriented)

== References == {{reflist}}

{{DEFAULTSORT:Data Orientation}} Category:Database models