{{Short description|Function that counts distinct factors of a string}} In computer science, the '''complexity function''' of a ''word'' or ''string'' (a finite or infinite sequence of symbols from some alphabet) is the function that counts the number of distinct ''factors'' (substrings of consecutive symbols) of that string. More generally, the complexity function of a formal language (a set of finite strings) counts the number of distinct words of given length.
==Complexity function of a word== Let ''u'' be a (possibly infinite) sequence of symbols from an alphabet. Define the function ''p''<sub>''u''</sub>(''n'') of a positive integer ''n'' to be the number of different factors (consecutive substrings) of length ''n'' from the string ''u''.<ref name=L7>Lothaire (2011) p.7</ref><ref name=L46/><ref name=PF3>Pytheas Fogg (2002) p.3</ref><ref name=BLRS82>Berstel et al (2009) p.82</ref><ref name=AS298>Allouche & Shallit (2003) p.298</ref>
For a string ''u'' of length at least ''n'' over an alphabet of size ''k'' we clearly have :<math> 1 \le p_u(n) \le k^n \ , </math> the bounds being achieved by the constant word and a disjunctive word,<ref name=Bug91>Bugeaud (2012) p.91</ref> for example, the Champernowne word respectively.<ref name=CN165>Cassaigne & Nicolas (2010) p.165</ref> For infinite words ''u'', we have ''p''<sub>''u''</sub>(''n'') bounded if ''u'' is ultimately periodic (a finite, possibly empty, sequence followed by a finite cycle). Conversely, if ''p''<sub>''u''</sub>(''n'') ≤ ''n'' for some ''n'', then ''u'' is ultimately periodic.<ref name=PF3/><ref name=AS302>Allouche & Shallit (2003) p.302</ref>
An '''aperiodic sequence''' is one which is not ultimately periodic. An aperiodic sequence has strictly increasing complexity function (this is the '''Morse–Hedlund theorem'''),<ref name=BR166/><ref name=CN166>Cassaigne & Nicolas (2010) p.166</ref> so ''p''(''n'') is at least ''n''+1.<ref name=L22>Lothaire (2011) p.22</ref>
A set ''S'' of finite binary words is ''balanced'' if for each ''n'' the subset ''S''<sub>''n''</sub> of words of length ''n'' has the property that the Hamming weight of the words in ''S''<sub>''n''</sub> takes at most two distinct values. A '''balanced sequence''' is one for which the set of factors is balanced.<ref name=AS313>Allouche & Shallit (2003) p.313</ref> A balanced sequence has complexity function at most ''n''+1.<ref name=L48>Lothaire (2011) p.48</ref>
A Sturmian word over a binary alphabet is one with complexity function ''n'' + 1.<ref name=PF6>Pytheas Fogg (2002) p.6</ref> A sequence is Sturmian if and only if it is balanced and aperiodic.<ref name=L46>Lothaire (2011) p.46</ref><ref name=AS318>Allouche & Shallit (2003) p.318</ref> An example is the Fibonacci word.<ref name=PF6/><ref name=deluca1995>{{cite journal | journal=Information Processing Letters | year=1995 | pages=307–312 | doi=10.1016/0020-0190(95)00067-M | title=A division property of the Fibonacci word | first=Aldo | last=de Luca | volume=54 | issue=6 }}</ref> More generally, a Sturmian word over an alphabet of size ''k'' is one with complexity ''n''+''k''−1. An Arnoux-Rauzy word over a ternary alphabet has complexity 2''n'' + 1:<ref name=PF6/> an example is the Tribonacci word.<ref name=PF368>Pytheas Fogg (2002) p.368</ref>
For recurrent words, those in which each factor appears infinitely often, the complexity function almost characterises the set of factors: if ''s'' is a recurrent word with the same complexity function as ''t'' are then ''s'' has the same set of factors as ''t'' or δ''t'' where δ denotes the letter doubling morphism ''a'' → ''aa''.<ref name=BLRS84>Berstel et al (2009) p.84</ref>
==Complexity function of a language== Let ''L'' be a language over an alphabet and define the function ''p''<sub>''L''</sub>(''n'') of a positive integer ''n'' to be the number of different words of length ''n'' in ''L''<ref name=BR166>Berthé & Rigo (2010) p.166</ref> The complexity function of a word is thus the complexity function of the language consisting of the factors of that word.
The complexity function of a language is less constrained than that of a word. For example, it may be bounded but not eventually constant: the complexity function of the regular language <math>a(bb)^*a</math> takes values 3 and 4 on odd and even ''n''≥2 respectively. There is an analogue of the Morse–Hedlund theorem: if the complexity of ''L'' satisfies ''p''<sub>''L''</sub>(''n'') ≤ ''n'' for some ''n'', then ''p''<sub>''L''</sub> is bounded and there is a finite language ''F'' such that<ref name=BR166/>
:<math>L \subseteq \{ x y^k z : x,y,z \in F,\ k \in \mathbb{N} \} \ . </math>
A '''polynomial''' or '''sparse language''' is one for which the complexity function ''p''(''n'') is bounded by a fixed power of ''n''. A regular language which is not polynomial is ''exponential'': there are infinitely many ''n'' for which ''p''(''n'') is greater than ''k''<sup>''n''</sup> for some fixed ''k'' > 1.<ref name=BR136>Berthé & Rigo (2010) p.136</ref>
==Related concepts== {{see also|Linguistic sequence complexity}} The ''topological entropy'' of an infinite sequence ''u'' is defined by
:<math>H_{\mathrm{top}}(u) = \lim_{n \rightarrow \infty} \frac{\log p_u(n)}{n \log k} \ . </math>
The limit exists as the logarithm of the complexity function is subadditive.<ref name=PF4>Pytheas Fogg (2002) p.4</ref><ref name=AS303>Allouche & Shallit (2003) p.303</ref> Every real number between 0 and 1 occurs as the topological entropy of some sequence is applicable,<ref name=CN169>Cassaigne & Nicolas (2010) p.169</ref> which may be taken to be uniformly recurrent<ref name=BR391>Berthé & Rigo (2010) p.391</ref> or even uniquely ergodic.<ref name=BR169>Berthé & Rigo (2010) p.169</ref>
For ''x'' a real number and ''b'' an integer ≥ 2 then the complexity function of ''x'' in base ''b'' is the complexity function ''p''(''x'',''b'',''n'') of the sequence of digits of ''x'' written in base ''b''. If ''x'' is an irrational number then ''p''(''x'',''b'',''n'') ≥ ''n''+1; if ''x'' is rational then ''p''(''x'',''b'',''n'') ≤ ''C'' for some constant ''C'' depending on ''x'' and ''b''.<ref name=Bug91>Bugeaud (2012) p.91</ref> It is conjectured that for algebraic irrational ''x'' the complexity is ''b''<sup>''n''</sup> (which would follow if all such numbers were normal) but all that is known in this case is that ''p'' grows faster than any linear function of ''n''.<ref name=BR414>Berthé & Rigo (2010) p.414</ref>
The '''abelian complexity function''' ''p''<sup>ab</sup>(''n'') similarly counts the number of occurrences of distinct factors of given length ''n'', where now we identify factors that differ only by a permutation of the positions. Clearly ''p''<sup>ab</sup>(''n'') ≤ ''p''(''n''). The abelian complexity of a Sturmian sequence satisfies ''p''<sup>ab</sup>(''n'') = 2.<ref>{{cite book | chapter=On the Asymptotic Abelian Complexity of Morphic Words | title=Developments in Language Theory. Proceedings, 17th International Conference, DLT 2013, Marne-la-Vallée, France, June 18-21, 2013 | pages=94–105 | year=2013 | doi=10.1007/978-3-642-38771-5_10 | isbn=978-3-642-38770-8 | series=Lecture Notes in Computer Science | volume=7907 | issn=0302-9743 | publisher=Springer-Verlag | location=Berlin, Heidelberg | editor1-first=Marie-Pierre | editor1-last=Béal | editor2-first=Olivier | editor2-last=Carton | author1-first=Francine | author1-last=Blanchet-Sadri | author2-first=Nathan | author2-last=Fox }}</ref>
==References== {{reflist}} *{{cite book | last1 = Allouche | first1 = Jean-Paul | last2 = Shallit | first2 = Jeffrey | author2-link = Jeffrey Shallit | isbn = 978-0-521-82332-6 | publisher = Cambridge University Press | title = Automatic Sequences: Theory, Applications, Generalizations | year = 2003 | zbl=1086.11015 }} * {{cite book | last1=Berstel | first1=Jean | last2=Lauve | first2=Aaron | last3=Reutenauer | first3=Christophe | last4=Saliola | first4=Franco V. | title=Combinatorics on words. Christoffel words and repetitions in words | series=CRM Monograph Series | volume=27 | location=Providence, RI | publisher=American Mathematical Society | year=2009 | isbn=978-0-8218-4480-9 | url=https://www.ams.org/bookpages/crmm-27 | zbl=1161.68043 }} * {{cite book | editor1-last=Berthé | editor1-first=Valérie|editor1-link= Valérie Berthé | editor2-last=Rigo | editor2-first=Michel | title=Combinatorics, automata, and number theory | series=Encyclopedia of Mathematics and its Applications | volume=135 | location=Cambridge | publisher=Cambridge University Press | year=2010 | isbn=978-0-521-51597-9 | zbl=1197.68006 }} *{{cite book | last=Bugeaud | first=Yann | title=Distribution modulo one and Diophantine approximation | series=Cambridge Tracts in Mathematics | volume=193 | location=Cambridge | publisher=Cambridge University Press | year=2012 | isbn=978-0-521-11169-0 | zbl=1260.11001}} * {{cite book | last1=Cassaigne | first1=Julien | last2=Nicolas | first2=François | chapter=Factor complexity | pages=163–247 | editor1-last=Berthé | editor1-first=Valérie|editor1-link= Valérie Berthé | editor2-last=Rigo | editor2-first=Michel | title=Combinatorics, automata, and number theory | series=Encyclopedia of Mathematics and its Applications | volume=135 | location=Cambridge | publisher=Cambridge University Press | year=2010 | isbn=978-0-521-51597-9 | zbl=1216.68204 }} * {{cite book | last=Lothaire | first=M. | authorlink=M. Lothaire | title=Algebraic combinatorics on words | others=With preface by Jean Berstel and Dominique Perrin | edition=Reprint of the 2002 hardback | series=Encyclopedia of Mathematics and Its Applications | volume=90| publisher=Cambridge University Press | year=2011 | isbn=978-0-521-18071-9 | zbl=1221.68183 }} * {{cite book | last=Pytheas Fogg | first=N. | editor1=Berthé, Valérie|editor1-link=Valérie Berthé|editor2=Ferenczi, Sébastien|editor3=Mauduit, Christian|editor4=Siegel, A. | title=Substitutions in dynamics, arithmetics and combinatorics | series=Lecture Notes in Mathematics | volume=1794 | location=Berlin | publisher=Springer-Verlag | year=2002 | isbn=3-540-44141-7 | zbl=1014.11015 }}
Category:Theoretical computer science