Truth discovery

{{Short description|Process of choosing the actual true value for a data item}} '''Truth discovery''' (also known as '''truth finding''') is the process of choosing the actual ''true value'' for a data item when different data sources provide conflicting information on it.

Several algorithms have been proposed to tackle this problem, ranging from simple methods like majority voting to more complex ones able to estimate the trustworthiness of data sources.<ref name=":3">{{Cite journal|last1=Li|first1=Yaliang|last2=Gao|first2=Jing|author2-link=Jing Gao (computer scientist)|last3=Meng|first3=Chuishi|last4=Li|first4=Qi|last5=Su|first5=Lu|last6=Zhao|first6=Bo|last7=Fan|first7=Wei|last8=Han|first8=Jiawei|date=2016-02-25|title=A Survey on Truth Discovery|journal=ACM SIGKDD Explorations Newsletter|language=en|volume=17|issue=2|pages=1–16|doi=10.1145/2897350.2897352|s2cid=9060471 }}</ref>

Truth discovery problems can be divided into two sub-classes: single-truth and multi-truth. In the first case only one true value is allowed for a data item (e.g birthday of a person, capital city of a country). While in the second case multiple true values are allowed (e.g. cast of a movie, authors of a book).<ref name=":0">{{Cite book|last1=Wang|first1=Xianzhi|last2=Sheng|first2=Quan Z.|last3=Fang|first3=Xiu Susie|last4=Yao|first4=Lina|last5=Xu|first5=Xiaofei|last6=Li|first6=Xue|title=Proceedings of the 24th ACM International on Conference on Information and Knowledge Management |chapter=An Integrated Bayesian Approach for Effective Multi-Truth Discovery |date=2015|chapter-url=http://dl.acm.org/citation.cfm?doid=2806416.2806443|language=en|location=Melbourne, Australia|publisher=ACM Press|pages=493–502|doi=10.1145/2806416.2806443|isbn=9781450337946|hdl=2440/110033|s2cid=16207808 |hdl-access=free}}</ref><ref name=":4">{{Cite journal|last1=Lin|first1=Xueling|last2=Chen|first2=Lei|title=Domain-aware multi-truth discovery from conflicting sources |journal=Proceedings of the VLDB Endowment |date=2018|volume=11|issue=5|pages=635–647|doi=10.1145/3187009.3177739}}</ref>

Typically, truth discovery is the last step of a data integration pipeline, when the schemas of different data sources have been unified and the records referring to the same data item have been detected.<ref name=":1">{{Cite journal|last1=Dong|first1=Xin Luna|author1-link=Xin Luna Dong|last2=Srivastava|first2=Divesh|date=2015-02-15|title=Big Data Integration|journal=Synthesis Lectures on Data Management|language=en|volume=7|issue=1|pages=1–198|doi=10.2200/S00578ED1V01Y201404DTM040|issn=2153-5418|doi-access=free}}</ref>

== General principles == The abundance of data available on the web makes more and more probable to find that different sources provide (partially or completely) different values for the same data item. This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery algorithms.<ref name=":2">{{Cite journal|last1=Li|first1=Xian|last2=Dong|first2=Xin Luna|author2-link=Xin Luna Dong|last3=Lyons|first3=Kenneth|last4=Meng|first4=Weiyi|last5=Srivastava|first5=Divesh|date=2012-12-01|title=Truth finding on the deep web: is the problem solved?|journal=Proceedings of the VLDB Endowment|language=en|volume=6|issue=2|pages=97–108|doi=10.14778/2535568.2448943|arxiv=1503.00303|s2cid=3133027 }}</ref>

Many currently available methods rely on a voting strategy to define the true value of a data item. Nevertheless, recent studies, have shown that, if we rely only on majority voting, we could get wrong results even in 30% of the data items.<ref name=":2" />

The solution to this problem is to assess the trustworthiness of the sources and give more importance to votes coming from trusted sources.<ref name=":1" /><ref name=":2" />

Ideally, supervised learning techniques could be exploited to assign a reliability score to sources after hand-crafted labeling of the provided values; unfortunately, this is not feasible since the number of needed labeled examples should be proportional to the number of sources, and in many applications the number of sources can be prohibitive.<ref name=":0" /><ref>{{Cite journal|last1=Ng|first1=Andrew Y|last2=Jordan|first2=Michael I.|date=2001|title=On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes|url=http://dl.acm.org/citation.cfm?id=2980539.2980648|journal=Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic|pages=841–848}}</ref>

== Single-truth vs multi-truth discovery == Single-truth and multi-truth discovery are two very different problems.<ref name=":0" />

Single-truth discovery is characterized by the following properties:

* only one true value is allowed for each data item; * different values provided for a given data item oppose to each other; * values and sources can either be correct or erroneous.

While in the multi-truth case the following properties hold:

* the truth is composed by a set of values; * different values could provide a partial truth; * claiming one value for a given data item does not imply opposing to all the other values; * the number of true values for each data item is not known ''a priori.''

Multi-truth discovery has unique features that make the problem more complex and should be taken into consideration when developing truth-discovery solutions.<ref name=":0" />

The examples below point out the main differences of the two methods. Knowing that in both examples the truth is provided by source 1, in the single truth case (first table) we can say that sources 2 and 3 oppose to the truth and as a result provide wrong values. On the other hand, in the second case (second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth. {| class="wikitable" |+When was George Washington born? !Source !Name !Birthdate ! |- |S1 |George Washington |1732-02-22 |Correct |- |S2 |George Washington |1738-09-17 |Erroneous |- |S3 |George Washington |1734-10-23 |Erroneous |} {| class="wikitable" |+Who wrote “The nature of space and time”? !Source !Title !Authors ! |- |S1 |The nature of space and time |Stephen Hawking, Roger Penrose |Correct |- |S2 |The nature of space and time |Stephen Hawking |Partial truth |- |S3 |The nature of space and time |Roger Penrose |Partial truth |- |S4 |The nature of space and time |J. K. Rowling |Erroneous |}

== Source trustworthiness == The vast majority of truth discovery methods are based on a voting approach: each source votes for a value of a certain data item and, at the end, the value with the highest vote is select as the true one. In the more sophisticated methods, votes do not have the same weight for all the data sources, more importance is indeed given to votes coming from trusted sources.<ref name=":2" />

Source trustworthiness usually is not known ''a'' ''priori'' but estimated with an iterative approach. At each step of the truth discovery algorithm the trustworthiness score of each data source is refined, improving the assessment of the true values that in turn leads to a better estimation of the trustworthiness of the sources. This process usually ends when all the values reach a convergence state.<ref name=":2" />

Source trustworthiness can be based on different metrics, such as accuracy of provided values, copying values from other sources and domain coverage.<ref name=":3" />

Detecting copying behaviors is very important, in fact, copy allows to spread false values easily making truth discovery very hard, since many sources would vote for the wrong values. Usually systems decrease the weight of votes associated to copied values or even don’t count them at all.<ref name=":5">{{Cite journal|last1=Dong|first1=Xin Luna|author1-link=Xin Luna Dong|last2=Berti-Equille|first2=Laure|last3=Srivastava|first3=Divesh|date=2009-08-01|title=Integrating conflicting data: the role of source dependence|journal=Proceedings of the VLDB Endowment|language=en|volume=2|issue=1|pages=550–561|doi=10.14778/1687627.1687690|s2cid=9664056 |url=https://inria.hal.science/hal-01855870 }}</ref>

== Single-truth methods == Most of the currently available truth discovery methods have been designed to work well only in the single-truth case.<ref name=":3" /><ref name=":4" />

Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness.<ref name=":2" />

=== Majority voting === Majority voting is the simplest method, the most popular value is selected as the true one. Majority voting is commonly used as a baseline when assessing the performances of more complex methods.

=== Web-link based === These methods estimate source trustworthiness exploiting a similar technique to the one used to measure authority of web pages based on web links. The vote assigned to a value is computed as the sum of the trustworthiness of the sources that provide that particular value, while the trustworthiness of a source is computed as the sum of the votes assigned to the values that the source provides.<ref name=":2" /><ref>{{Cite journal|last=Kleinberg|first=Jon M.|date=1999-09-01|title=Authoritative sources in a hyperlinked environment|journal=Journal of the ACM|volume=46|issue=5|pages=604–632|doi=10.1145/324133.324140|s2cid=221584113 |doi-access=free}}</ref>

=== Information-retrieval based === These methods estimate source trustworthiness using similarity measures typically used in information retrieval. Source trustworthiness is computed as the cosine similarity (or other similarity measures) between the set of values provided by the source and the set of values considered true (either selected in a probabilistic way or obtained from a ground truth).<ref name=":2" /><ref>{{Cite book|last1=Galland|first1=Alban|last2=Abiteboul|first2=Serge|last3=Marian|first3=Amélie|last4=Senellart|first4=Pierre|title=Proceedings of the third ACM international conference on Web search and data mining |chapter=Corroborating information from disagreeing views |date=2010|chapter-url=http://portal.acm.org/citation.cfm?doid=1718487.1718504|language=en|location=New York, New York, USA|publisher=ACM Press|pages=131–140|doi=10.1145/1718487.1718504|isbn=9781605588896|s2cid=1761360 |url=https://hal.inria.fr/inria-00429546/file/document.pdf }}</ref>

=== Bayesian based === These methods use Bayesian inference to define the probability of a value being true conditioned on the values provided by all the sources.

where <math>\textstyle v</math> is a value provided for a data item <math>\textstyle o</math> and <math>\textstyle \psi(o)</math> is the set of the observed values provided by all the sources for that specific data item.

The trustworthiness of a source is then computed based on the accuracy of the values that provides.<ref name=":5" /><ref>{{Cite journal|last1=Xiaoxin Yin|last2=Jiawei Han|last3=Yu|first3=P.S.|date=2008|title=Truth Discovery with Multiple Conflicting Information Providers on the Web|journal=IEEE Transactions on Knowledge and Data Engineering|volume=20|issue=6|pages=796–808|doi=10.1109/TKDE.2007.190745|bibcode=2008ITKDE..20..796Y |issn=1041-4347}}</ref> Other more complex methods exploit Bayesian inference to detect copying behaviors and use these insights to better assess source trustworthiness.<ref name=":5" />

== Multi-truth methods == Due to its complexity, less attention has been devoted to the study of the multi-truth discovery<ref name=":0" /><ref name=":4" />

Below are reported two typologies of multi-truth methods and their characteristics.

=== Bayesian based === These methods use Bayesian inference to define the probability of a group of values being true conditioned on the values provided by all the data sources. In this case, since there could be multiple true values for each data item, and sources can provide multiple values for a single data item, it is not possible to consider values individually. An alternative is to consider mappings and relations between set of provided values and sources providing them. The trustworthiness of a source is then computed based on the accuracy of the values that provides.<ref name=":0" />

More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness.<ref name=":0" /><ref name=":4" />

=== Probabilistic Graphical Models based === These methods use probabilistic graphical models to automatically define the set of true values of given data item and also to assess source quality without need of any supervision.<ref>{{Cite journal|last1=Zhao|first1=Bo|last2=Rubinstein|first2=Benjamin I. P.|last3=Gemmell|first3=Jim|last4=Han|first4=Jiawei|date=2012-02-01|title=A Bayesian approach to discovering truth from conflicting sources for data integration|journal=Proceedings of the VLDB Endowment|language=en|volume=5|issue=6|pages=550–561|doi=10.14778/2168651.2168656|arxiv=1203.0058|s2cid=8837716 }}</ref>

== Applications == Many real-world applications can benefit from the use of truth discovery algorithms. Typical domains of application include: healthcare, crowd/social sensing, crowdsourcing aggregation, information extraction and knowledge base construction.<ref name=":3" />

Truth discovery algorithms could be also used to revolutionize the way in which web pages are ranked in search engines, going from current methods based on link analysis like PageRank, to procedures that rank web pages based on the accuracy of the information they provide.<ref>{{Cite web|url=https://www.washingtonpost.com/news/energy-environment/wp/2015/03/11/the-huge-implications-of-googles-idea-to-rank-sites-based-on-their-accuracy/|title=The huge implications of Google's idea to rank sites based on their accuracy|date=2015|website=www.washingtonpost.com}}</ref>

== See also ==

* Data Integration * Information Integration * Data Fusion (data integration) * Data Quality

== References == <references />

Category:Databases