{{Short description|Diff and merge files on computers}} {{About|data object, text, and file comparisons in computing||Comparison (disambiguation){{!}}Comparison}} [[File:Kompare.png|thumb|The KDE diff tool ''Kompare'']]
In computing, '''file comparison''' is the calculation and display of the differences and similarities between data objects, typically text files such as source code.
The methods, implementations, and results are typically called a '''diff''',<ref>[http://catb.org/jargon/html/D/diff.html "diff", The Jargon File].</ref> after the Unix <code>diff</code> utility. The output may be presented in a graphical user interface or used as part of larger tasks in networks, file systems, or revision control.
Some widely used file comparison programs are diff, cmp, FileMerge, WinMerge, Beyond Compare, and File Compare.
Many text editors and word processors perform file comparison to highlight the changes to a file or document.
== Method types ==
Most file comparison tools find the longest common subsequence between two files. Any data not in the longest common subsequence is presented as a change or an insertion or a deletion.
In 1978, Paul Heckel published an algorithm that identifies most moved blocks of text.<ref>{{Citation|first1=Paul|last1=Heckel|year=1978|title=A Technique for Isolating Differences Between Files|journal=Communications of the ACM|volume=21|issue=4|pages=264–268|url=http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf|access-date=2011-12-04|doi=10.1145/359460.359467|s2cid=207683976}}</ref> This is used in the IBM History Flow tool.<ref>{{Citation|last1=Viégas|first1=Fernanda B.|last2=Wattenberg|first2=Martin|last3=Kushal|first3=Kushal Dave|year=2004|title=Studying Cooperation and Conflict between Authors with history flow Visualizations|publisher=CHI|volume=6|pages=575–582|publication-place=Vienna|url=http://domino.watson.ibm.com/cambridge/research.nsf/58bac2a2a6b05a1285256b30005b3953/53240210b04ea0eb85256f7300567f7e/$FILE/TR2004-19.pdf|access-date=2011-12-01}}</ref> Other file comparison programs find block moves.{{Clarify|date=January 2012}}
Some specialized file comparison tools find the longest increasing subsequence between two files.<ref name="PatentUS7031972B2">{{cite web |author1=Liwei Ren |author2=Jinsheng Gu |author3=Luosheng Peng |title=Algorithms for block-level code alignment of software binary files |url=https://patents.google.com/patent/US7031972 |website=Google Patents |publisher=USPTO |access-date=10 May 2019 |date=18 April 2006}}</ref> The rsync protocol uses a rolling hash function to compare two files on two distant computers with low communication overhead.
File comparison in word processors is typically at the word level, while comparison in most programming tools is at the line level. Byte or character-level comparison is useful in some specialized applications.
== Display ==
Display of file comparison varies, with the main approaches being either showing two files side-by-side, or showing a single file, with markup showing the changes from one file to the other. In either case, particularly side-by-side viewing, code folding or text folding may be used to hide unchanged portions of the file, only showing the changed portions.{{Clarify|date=January 2012}}
== Reasoning ==
Comparison tools are used for various reasons. When one wishes to compare binary files, byte-level is probably best. But if one wishes to compare text files or computer programs, a side-by-side visual comparison is usually best.<ref>{{Cite book |last1=MacKenzie |first1=David |url=https://books.google.com/books?id=oIINAAAACAAJ |title=Comparing and Merging Files with Gnu Diff and Patch |last2=Eggert |first2=Paul |last3=Stallman |first3=Richard |date=2003 |publisher=Network Theory |isbn=978-0-9541617-5-0 |language=en}}</ref> This gives the user the chance to decide which file is the preferred one to retain, if the files should be merged to create one containing all the differences,<ref>{{Cite web |title=File comparison software: vc-dwim and vc-chlog |url=http://www.gnu.org/software/vc-dwim/vc-dwim.html |access-date=2023-04-16 |website=www.gnu.org}}</ref> or perhaps to keep them both as-is for later reference, through some form of "versioning" control.
File comparison is an important, and most likely integral, part of file synchronization and backup. In backup methodologies, the issue of data corruption is an important one. Corruption occurs without warning and without one's knowledge; at least usually until too late to recover the missing parts. Usually, the only way to know for sure if a file has become corrupted is when it is next used or opened. Barring that, one must use a comparison tool to at least recognize that a difference has occurred. Therefore, all file sync or backup programs must include file comparison if these programs are to be actually useful and trusted.<ref>{{Cite web |title=SystemRescue - System Rescue Homepage |url=https://www.system-rescue.org/ |access-date=2023-04-16 |website=www.system-rescue.org}}</ref>
== Historical uses ==
Prior to file comparison, machines existed to compare magnetic tapes or punch cards. The IBM 519 Card Reproducer could determine whether a deck of punched cards were equivalent. In 1957, John Van Gardner developed a system to compare the check sums of loaded sections of Fortran programs to debug compilation problems on the IBM 704.<ref>{{cite web |url=http://www.softwarepreservation.org/projects/FORTRAN/paper/John%20Van%20Gardner%20-%20Fortran%20And%20The%20Genesis%20Of%20Project%20Intercept.pdf |title=Fortran And The Genesis Of Project Intercept |author=John Van Gardner |access-date=2011-12-06}}</ref>
== See also ==
* {{Annotated link |Comparison of file comparison tools}} * {{Annotated link |Computer-assisted reviewing}} * {{Annotated link |Data differencing}} * {{Annotated link |Delta encoding}} * {{Annotated link |Document comparison}} * {{Annotated link |Edit distance}} * {{Annotated link |String metric}} * Ramseyer Rule - standard format for amending legal text
== References == {{Reflist}}
== External links == {{commons category|File comparison}}
{{Computer files}} {{Version control software}}
Category:File comparison tools Category:Data differencing Category:Utility software types