.fp 4 SC .nr Pt 1 .EQ delim $$ gsize 12 define scrC % font 4 size +2 C % .EN .am DE .S 12 18 .sp .. .nr W 6.5i .ll 6.5i .ds HP 12 12 12 12 12 .ds HF 3 3 3 3 3 .S 12 .ls 1 .ce 99 .B ENUMERATION OF STRINGS .R .sp .I A. M. Odlyzko .R AT&T Bell Laboratories Murray Hill, New Jersey 07974 USA .sp 1.5 .I ABSTRACT .R .ce 0 .sp .S 12 18 .P A survey is presented of some methods and results on counting words that satisfy various restrictions on subwords (i.e., blocks of consecutive symbols). Various applications to comma-free codes, games, pattern matching, and other subjects are indicated. The emphasis is on the unified treatment of those topics through the use of generating functions. .S 12 18 .H 1 "Introduction" Enumeration of words that satisfy restrictions on their subwords is a very satisfying subject to deal with. Not only are there many applications for results in this area, but in addition one can use a wide range of mathematical techniques which lead to a variety of elegant results. These methods are usually not applicable when instead of subwords (i.e., blocks of consecutive symbols) one studies general subsequences. .P We will consider strings (words and patterns will be used synonymously) over a fixed finite alphabet $OMEGA$ of size $q ~>=~ 2$. The basic results, from which most of the others are derived, enumerate strings of a given length that contain no elements of some fixed set of strings as subwords. The enumeration does not yield simple explicit formulas for the number of such strings, as no such formulas exist. Instead, formulas are obtained for the generating functions of these quantities. These generating functions are rational, and their special form leads to numerous results, both in asymptotic enumeration and in more algebraic and combinatorial applications. .P In Section 2 we briefly review how rational generating functions can be used in both asymptotic enumeration and other settings. In Section 3, the important notion of a \f2correlation\f1 of two strings is defined. Using that notion, Section 4 then presents the main formulas for the generating functions that arise in the enumeration of strings. Section 5 discusses some applications of these generating functions to the study of worst-case behavior of pattern matching algorithms, recognizing unavoidable sets of words, and nontransitive games. Section 6 discusses some of the main applications of these generating functions to asymptotic enumeration, especially to the study of prefix-synchronized codes and occurrences of long repetitive patterns in strings. .P This paper is based largely on the author's joint work with L. J. Guibas [11-15] and B. Richmond [21]. Only a small selection of results are presented, and they are chosen so as to emphasize how generating functions can be used as a unifying theme in a variety of contexts. Many of those results can also be obtained by other methods. .H 1 "Rational generating functions" If $f(0) , ~f(1) ^,...$ is any sequence, we define its generating function $F(z)$ to be .DS 3 .EQ (2.1) F(z) ~=~ sum from n=0 to inf ~f(n) z sup -n ^, .EN .DE where $z$ is an indeterminate. We use the form (2.1), instead of the more common form $F(z sup -1 )$, which is a sum of terms of the form $f(n) z sup n$, since (2.1) yields somewhat more elegant expressions for the generating functions arising in string enumeration. .P In essentially all of the cases we will be concerned with, the generating function $F(z)$ will be rational, so that .DS 3 .EQ (2.2) F(z) ~=~ {P(z)} over {Q(z)} ~, .EN .DE where $P(z)$ and $Q(z)$ are polynomials, which we may assume satisfy gcd$(P(z), ~Q(z) ) ~=~ 1$, $Q(z) ~!=~ 0$. Several conclusions follow when $F(z)$ is rational. First of all, the $f(n)$ satisfy a linear recurrence with constant coefficients; if we write .DS 3 .EQ (2.3) Q(z) ~=~ sum from j=0 to d ~a sub j ~z sup q-j ^, .EN .DE where $d$ is the degree of $Q (z)$, so that $a sub 0 ~!=~ 0$, then (2.1) and (2.2) show that .DS 3 .EQ (2.4) f(n) ~=~ - a sub 0 sup -1 ~sum from j=1 to d ~a sub j f(n-j) ~~~roman for ~~ n ~>~ d ~. .EN .DE This linear recurrence can already be used to prove nontrivial results. For example, when coupled with some information about the coefficients $a sub j$, it can be used to obtain lower bounds on the worst case performance of string searching algorithms (see Section 5). .P Perhaps the most important consequence of the rationality of a generating function is that it yields significant information about the asymptotic behavior of the sequence being enumerated. If $F(z)$ has the form (2.2), and in addition $Q(z)$ has the form .DS 3 .EQ (2.4) Q(z) ~=~ a sub 0 ~prod from j=1 to d ~(z- rho sub j ) ^, .EN .DE where the $rho sub j$ are distinct, then the partial fraction expansion of $F(z)$ is .DS 3 .EQ (2.5) F(z) ~=~ c sub 0 ~+~ sum from j=1 to d ~{c sub j} over {z- rho sub j} ^, .EN .DE where $c sub 0$ is some constant (there is no polynomial part to this expansion, since deg $P(z) ~<=$ deg $Q(z)$ by (2.1)), and .DS 3 .EQ (2.6) c sub j ~=~ {P( rho sub j )} over {Q prime ( rho sub j )} ~,~~~ 1 <= j <= d ~. .EN .DE If the $rho sub j$ are not all distinct, then (2.5) has to be replaced by a slightly more complicated expression, but we will not need to use it. The partial fraction expansion (2.5) shows that .DS 3 .EQ F(z) ~mark =~ c sub 0 ~+~ sum from j=1 to d ~{c sub j z sup -1} over {1- rho sub j z sup -1} .EN .sp .EQ (2.7) ~lineup =~ c sub 0 ~+~ sum from j=1 to d ~c sub j z sup -1 ~sum from n=0 to inf ~rho sub j sup n z sup -n .EN .sp .EQ ~lineup =~ c sub 0 ~+~ sum from n=1 to inf ~z sup -n ~sum from j=1 to d ~c sub j rho sub j sup n-1 ~. .EN .DE This shows that .DS 3 .EQ (2.8) f(n) ~=~ sum from j=1 to d ~c sub j rho sub j sup n-1 ~~,~~ n >= 1 ~. .EN .DE In particular, the expansion (2.8) shows that asymptotically, the growth rate of $f(n)$ is determined by the largest of the $rho sub j$; .DS 3 .EQ (2.9) | f(n) | ~<=~ c ~max from {1 <= j <= d} ~ | rho sub j | sup n .EN .DE for some constant $c$. For finer details of the asymptotic behavior of $f(n)$, Eq.\ (2.8) by itself is often insufficient. The large zeros (those for which $| rho sub j |$ is maximal) can combine so as to make $f(n)$ small for some values of $n$, as happens, for example, for $F(z) ~=~ z sup 2 / (z sup 2 -1)$, where $f(n) ~=~ 0$ if $n$ is odd, and $f(n) ~=~ 1$ if $n$ is even. Some very sophisticated methods have been developed to deal with problems of this nature, as can be seen in [18] and the references cited there. For our purposes, though, such methods are not suitable. .P Another source of difficulty with the use of expansions of the form (2.8) is that even when there is only one $rho sub j$ with maximal absolute value, the other $rho sub j$ can be important for small $n$. To obtain uniform estimates from (2.8), it becomes necessary to estimate the sizes of the $c sub j$ as well as of the $rho sub j$. .P Fortunately for us, in string enumeration we are able to use a simple approach that lets us bypass the difficulties that arise in using the full partial fraction expansion. We have the following well-known general result. .sp \f2Lemma 2.1\f1. .I Suppose that $F(z)$ is analytic in $| z | ~>=~ r ~>~ 0$ with the exception of $z ~=~ rho , ~ | rho | ~>~ r$, where it has a first order pole with residue $alpha$, and that .DS 3 .EQ (2.10) | F(z) | ~<=~ C ~~italic "for"~~ | z | ~=~ r ~. .EN .DE Then, if .DS 3 .EQ F(z) ~=~ sum from n=0 to inf ~ f sub n ~z sup -n ~, .EN .DE we have .DS 3 .EQ (2.11) | f sub n ~-~ alpha rho sup n-1 | ~<=~ r sup n (C + | alpha | ( | rho | - r ) sup -1 ) ~~italic "for" ~~ n >= 1 ~. .EN .DE .R \f2Proof\f1. We use Cauchy's theorem. Since .DS 3 .EQ F(z) ~-~ alpha over {z- rho} ~=~ f sub 0 ~+~ ~sum from n=1 to inf ~( f sub n ~-~ alpha rho sup n-1 ) z sup -n .EN .DE and is analytic in $| z | ~>=~ r$, we have for $n >= 1$ .DS 3 .EQ f sub n ~-~ alpha rho sup n-1 ~=~ 1 over {2 pi i} ~int from {| z | =r} ~ z sup n-1 left { F(z) ~-~ alpha over {z- rho} right } ~dz, .EN .DE and so .DS 3 .EQ | f sub n ~-~ alpha rho sup n-1 | ~mark <=~ r sup n ~max from {| z | =r} ~ | F(z) ~-~ alpha over {z- rho} | .EN .sp .EQ ~lineup <=~ r sup n ~left ( C ~+~ {| alpha |} over {| rho | -r} right ) ^, .EN .DE which was to be proved. \(sq .P In the applications to string enumeration, the rational functions we will deal with satisfy the conditions of the lemma above. They have a single pole (which is real) and no other singularities nearby. Lemma 2.1 enables us to obtain very sharp asymptotic estimates without having to worry about existence of higher order poles or sizes of the residues other than at the dominant singularity. Thus the basic method of asymptotic estimation is very simple, especially when compared to some of the other asymptotic enumeration methods that are used in other situations (cf. [4,20]). The main difficulty is to obtain the generating functions for the quantities that are of interest and show that they satisfy the conditions of Lemma 2.1. As the following sections show, this is possible in a surprising variety of cases, but not always. .H 1 "Correlations of strings" In order to describe the generating functions that arise in string enumeration, we introduce notation (from [13]) that expresses the way that two strings can overlap each other. If $X$ and $Y$ are two words, possibly of different lengths, over some alphabet, the \f2correlation\f1 of $X$ and $Y$, denoted by $XY$, is a binary string of the same length as $X$. The $i$-th bit (from the left) of $XY$ is determined by placing $Y$ under $X$ so that the leftmost character of $Y$ is under the $i$-th character (from the left) of $X$, and checking whether the characters in the overlapping segments of $X$ and $Y$ are identical. If they are identical, the $i$-th bit of $XY$ is set equal to 1, and otherwise it is set equal to 0. For example, if $X ~=~ cabcabc$ and $Y ~=~ abcabcde$, then $XY ~=~ 0100100$, as depicted below: .DS .TS center; c1 c1 c1 c1 c1 c1 c1 c1 c1 c1 c1 c1 c1 c1 c4 n. X: c a b c a b c Y: a b c a b c d e 0 a b c a b c d e 1 a b c a b c d e 0 a b c a b c d e 0 a b c a b c d e 1 a b c a b c d e 0 a b c a b c d e 0 .TE .DE Note that in general $YX ~!=~ XY$ (they can even be of different lengths), and in the example above $YX ~=~ 00000000$. The \f2autocorrelation\f1 of a word $X$ is just $XX$, the correlation of $X$ with itself. In the example above, $XX ~=~ 1001001$. .P It is often convenient to interpret a correlation $XY$ as a number in some base $t$, or else as a polynomial in the variable $t$, with the most significant bits being the ones on the left. Thus in the example above, .DS 3 .EQ XY sub t ~=~ t sup 5 ~+~ t sup 2 , ~~~XY sub 2 ~=~ 36 ~. .EN .DE .P String correlations are important because they provide very compact descriptions of how two words can overlap. They are also fascinating subjects of study in their own right, and characterizations of strings that are correlations and enumeration of strings that have a given correlation are contained in [14] (see also [10]). .H 1 "Basic generating functions" Now that correlations of strings have been defined, we can present the basic generating function results. Suppose that $A$ is a word over some finite alphabet $OMEGA$ of $q ~>=~ 2$ letters, and let $f(n)$ be the number of strings of length $n$ over that alphabet that contain no occurrences of $A$ as a subword. We set $f(0) ~=~ 1$, and define, as in (2.1), .DS 3 .EQ (4.1) F(z) ~=~ sum from n=0 to inf ~f(n) z sup -n ~. .EN .DE Then the basic result is that .DS 3 .EQ (4.2) F(z) ~=~ {z ~A A sub z} over {1+ (z-q) A A sub z} ~. .EN .DE This elegant formula, which apparently was first published in a somewhat different form by Solov'ev [26], is quite easy to prove, as we will now demonstrate. First define $f sub A (n)$ to be the number of strings over $OMEGA$ of length $n$ that end with $A$ (at the right end) but contain no other occurrence of $A$ as a subword anyplace else. Set .DS 3 .EQ (4.3) F sub A (z) ~=~ sum from n=0 to inf ~f sub A (n) z sup -n ~. .EN .DE .P To prove the formula (4.2) for $F(z)$, we will show that .DS 3 .EQ (4.4a) (z-q) ~F(z) ~mark +~ z F sub A (z) ~=~ z ^, .EN .sp .EQ (4.4b) F(z) ~lineup =~ z A A sub z F sub A (z) ^, .EN .DE from which (4.2) and .DS 3 .EQ (4.5) F sub A (z) ~=~ 1 over {1+ (z-q) A A sub z} .EN .DE follow immediately. To prove (4.4a), consider any of the $f(n)$ strings of length $n$ that do not contain $A$ as a subword, and adjoin to it (at the right end) any one of the letters from $OMEGA$. In this way we obtain $qf(n)$ distinct strings of length $n+1$. Any one of them either contains an appearance of $A$ or not, but if it does, then that appearance has to be in the rightmost position. Therefore we obtain .DS 3 .EQ (4.6) qf(n) ~=~ f(n+1) ~+~ f sub A (n+1) ^, .EN .DE and this is valid for all $n ~>=~ 0$. Multiplying (4.6) by $z sup -n$ and summing on $n ~>=~ 0$, we easily obtain (4.4a). .P To obtain (4.4b), consider any one of the $f(n)$ strings of length $n$ that do not contain $A$, and append $A$ at its right end. Suppose that the first (from the left) appearance of $A$ in this string has its rightmost character in position $n+r$. Then $0 ~<~ r ~<=~ | A |$. Furthermore, since $A$ also appears in the last $| A |$ positions, the prefix of $A$ of length $r$ has to be equal to the suffix of $A$ of length $r$, and so the $r$-th entry in $AA$ (counted from the right) is 1, which we write as $r ~epsilon~ AA$. The string of length $n+r$ is then counted by $f sub A (n+r)$. Conversely, given any string counted by $f sub A (n+r)$, $r ~epsilon~ AA$, there is a unique way to extend it to a string of length $n + | A |$ in which the last $| A |$ characters equal $A$. Therefore for $n ~>=~ 0$ we have .DS 3 .EQ (4.7) f(n) ~=~ sum from {r epsilon AA} ~f sub A (n+r) ~. .EN .DE Multiplying (4.7) by $z sup -n$ and summing on $n$ yields (4.4b). .P The main result about string enumeration is proved by methods that are only slightly more elaborate than those shown above, and so we will only state it. First we introduce the notion of a \f2reduced\f1 set of words. The set $"{" A,B ^,...,^ T "}"$ is called reduced if no word in this set contains any other as a subword. Since we are primarily interested in strings that contain none of a set $"{" A ^,...,^ T "}"$ of words, we can reduce to the case of a reduced set of words by eliminating from that set any words that contain any others as subwords, since if $G$ is a subword of $Q$, say, strings not containing $G$ also fail to contain $Q$. Given a reduced set $"{" A,B ^,...,^ T "}"$ of words, we let $f(n)$ denote the number of strings of length $n$ over the alphabet $OMEGA$ that contain none of $A,B ^,...,^ T$, with $f(0) ~=~ 1$. We also let $f sub H (n)$ denote the number of strings of length $n$ over $OMEGA$ that contain none of $A,B ^,...,^ T$, except for a single appearance of $H$ at the right end of the string. We define the generating functions $F(z)$ of $f(n)$ and $F sub H (z)$ of $f sub H (n)$ by (4.1) and (4.3). .sp .I Theorem 4.1 ([13,25]). If $"{" A ^,...,^ T "}"$ is a reduced set of patterns, the generating functions $F(z) , ~F sub A (z) ^,...,^ F sub T (z)$ satisfy the following system of linear equations: .DS 3 .EQ (z-q) F(z) ~mark +~ z F sub A (z) ~+~ z F sub B (z) ^+...+^ z F sub T (z) ~=~ z .EN .sp .EQ F(z) ~lineup -~ z A A sub z F sub A (z) ~-~ z B A sub z F sub B (z) ^-...-^ z TA sub z F sub T (z) ~=~ 0 .EN .sp .EQ (4.8) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~... .EN .sp .EQ F(z) ~lineup -~ z AT sub z F sub A (z) ~-~ z B T sub z F sub B (z) ^-...-^ z TT sub z F sub T (z) ~=~ 0 ~. .EN .DE .R .P It is easy to show ([13]) that the system (4.8) is nonsingular, and therefore determines $F(z)$, $F sub A (z) ^,...,^ F sub T (z)$. Moreover, Cramer's rule shows that these generating functions are all rational. When there is only a single excluded word, $A$, the system (4.8) reduces to (4.4). When more than a single word is excluded, the expressions become much more complicated than the simple forms of (4.2) and (4.5), a topic we will return to later. .P Of the many other rational generating functions that arise in string enumeration, we will discuss just one more. Suppose that $A$ is a word over some alphabet $OMEGA$ of $q$ letters, and let $g(n)$ denote the number of strings of length $n$ over $OMEGA$ that contain two distinct appearances of $A$ at the beginning and at the end of the string, but no other occurrences of $A$ as a subword anyplace else. (In particular, $g(n) ~=~ 0$ for $n ~<=~ | A |$.) Then the generating function .DS 3 .EQ (4.9) G(z) ~=~ sum from n=0 to inf ~g(n) z sup -n .EN .DE satisfies .DS 3 .EQ (4.10) G(z) ~=~ z sup {- | A |} ~+~ {(q-z) z sup -1} over {1+ (z-q) AA sub z} ~. .EN .DE The proof of this result, which will be used in the discussion of prefix-synchronized codes in Section 6, is very similar to the proof of the other results mentioned above and can be found in [11]. Estimates for $g(n)$ are also central to the work described in [1,6]. .P There are many other rational generating functions that occur in string enumeration. In fact, essentially all generating functions that enumerate strings which satisfy restrictions on the number of appearances of particular subwords are rational, even when those generating functions are multivariate (cf. [9]). We will not discuss them any further, however, because they usually cannot be utilized profitably in asymptotic enumeration. Even Theorem 4.1 above has its limitations in this regard, as we will show. .H 1 "Elementary applications of generating functions" In this section we discuss some applications of the generating functions presented in Section 4 to subjects other than asymptotic enumeration. .H 2 "Coin-tossing games" The first application is to a rather amusing coin-tossing game. Suppose that $A$ is some sequence of heads or tails, and that we toss a fair coin until $A$ appears as a sequence of consecutive coin tosses (subword). What is the expected number of coin tosses until this happens? We can answer that question very easily using the generating functions of Section 4 (cf. [7,19]). The probability that $A$ does not appear in the first $n$ coin tosses is just $f(n) 2 sup -n$, where $f(n)$ denotes the number of strings of heads or tails of length $n$ that do not contain $A$ as a subword. Hence the expected number of coin tosses until $A$ appears is .DS 3 .EQ sum from n=0 to inf ~f(n) 2 sup -n ~=~ F(2) ^, .EN .DE and by (4.2) this is $2 AA sub 2$. In particular, this waiting time is strongly dependent on the periods of $A$. .P W. Penney [22] has invented a penny-tossing game that exhibits the somewhat paradoxical feature of having a nontransitive dominance relation. In Penney's game, two players agree on some integer $k ~>=~ 2$ at the start of the game. Player I then selects a pattern $A$ of heads and tails of length $k$. Player II is shown $A$, and then proceeds to select another pattern $B$ of $k$ heads and tails. A fair coin is then tossed until either $A$ or $B$ appears as a subword of the sequence of outcomes. If $A$ appears first, Player I wins, and if $B$ appears first, Player II wins. .P There are several possible reactions to Penney's game. One is to say that clearly the game is fair as everything is open. Another is to note that if the two players made their choices of $A$ and $B$ independently of each other (with the event $A ~=~ B$ resulting in a draw) the game would obviously be fair because of the symmetry of the players' positions, and so Penney's game ought to be favorable to Player II since that player receives additional information, namely the choice of Player I. A third reaction (usually restricted to those who know the waiting time result discussed above) is to say that Player I should have the edge in Penney's game, since he can choose $A$ to have very small waiting time. .P The most interesting feature of Penney's game is that if $k ~>=~ 3$, then no matter what $A$ is, Player II can choose $B$ so as to obtain probability of winning strictly greater than 1/2. (If $k ~=~ 2$ and $A ~=~ HT$, say, then Player II can only draw.) Thus this game is nontransitive in that for any $A$, one can find $B$ that beats it, but then one can find $C$ that beats $B$, etc. We do not have space to discuss the theory of Penney's game to any great extent (see [13]), but we indicate how the generating functions of Section 4 can be used in its analysis. .P Let $"{" A, B "}"$ be the set of excluded patterns. Then solving the linear system of Theorem 4.1 gives .DS 3 .EQ (5.1) F sub A (z) ~=~ {BB sub z ~-~ BA sub z} over {(z-2) (AA sub z ~cdot~ BB sub z - AB sub z ~cdot~ BA sub z ) ~+~ AA sub z + BB sub z - AB sub z - BA sub z}~. .EN .DE Now $f sub A (n) 2 sup -n$ is the probability that Player I wins exactly on the $n$-th throw. Hence Player I wins with probability .DS 3 .EQ (5.2) sum from n=0 to inf ~f sub A (n) 2 sup -n ~=~ F sub A (2) ~=~ {BB sub 2 ~-~ BA sub 2} over {AA sub 2 + BB sub 2 ~-~ AB sub 2 - BA sub 2}~. .EN .DE Similarly the probability that Player II wins is given by a formula like (5.2) but with $A$ and $B$ interchanged. Hence the odds of Player II winning are given by .DS 3 .EQ (5.3) {AA sub 2 ~-~ AB sub 2} over {BB sub 2 ~-~ BA sub 2}~. .EN .DE This elegant formula, first proved by Conway by a much more complicated method, can be used to analyze Penney's game, and to show that Player II always has a winning strategy. Together with some information about possible sets of periods of strings, it can also be used to determine the optimal choice for $B$. Details can be found in [13]. .H 2 "Worst-case behavior of pattern-matching algorithms" The next application of the generating function of Section 4 is to the study of pattern-matching algorithms. We consider the problem of deciding whether a given word $A$ of length $k$ occurs in a string $S$ of length $n$, with the cost of the algorithm being measured by the number of characters of $S$ that the algorithm looks at. Ever since the appearance of [17], it has been known that one can decide whether $A$ appears in $S$ in $O(n)$ character comparisons, and a lot of work has been done on deriving improved algorithms. Furthermore, starting with the Boyer-Moore algorithm [3], methods have been known that on average look at only $alpha n$ of the characters of $S$ for some $alpha ~<~ 1$ (which depends on the alphabet and on the algorithm). This situation gave rise to the question whether there is any pattern-matching algorithm that is sublinear in the worst case; i.e., which never looks at more than $beta n$ characters of $S$ for $n ~>=~ n sub 0$, where $beta ~<~ 1$. This question was answered in the negative by Rivest [24]. We present a simplified version of Rivest's proof which relies on the explicit form of the generating function (4.2) [13]. The problem is not trivial, since in many cases it is possible to decide whether $A$ appears in $S$ without looking at all the characters of $S$. For example, if $A ~=~ 00$ and $n ~=~ 4$, so $S ~=~ s sub 1 s sub 2 s sub 3 s sub 4$, we can query whether $s sub 2 ~=~ 0$. If the answer is that $s sub 2 ~!=~ 0$, then there will be no need to even look at $s sub 1$. Hence we may assume that the answer is that $s sub 2 ~=~ 0$. But then we can query whether $s sub 3 ~=~ 0$. If the answer is that $s sub 3 ~=~ 0$, then $A ~=~ s sub 2 ~s sub 3$, and we do not need to look at $s sub 1$ and $s sub 4$. If the answer is that $s sub 3 ~!=~ 0$, though, we do not need to look at $s sub 4$. Thus we can always decide whether $A$ is a subword of $S$ by looking at $<= 3$ characters of $S$. Note that we have phrased the question in terms of deciding whether $A$ occurs as a subword of $S$, instead of asking for the determination of all such occurrences of $A$. In the latter situation the problem is indeed trivial (at least in general), since if $A ~=~ 0...0$, and the answer to a query about any character is that it's 0, any algorithm will clearly have to look at all characters of $S$ to determine all the $n-k+1$ occurrences of $A$. .P To prove that there is no sublinear worst-case algorithm for determining whether $A$ is a subword of $S$, suppose that there is an algorithm which, given any string $S ~=~ s sub 1 ^...^ s sub n$ of length $n$, can determine whether $A$ is a subword of $S$ by looking at $<= ~n-1$ characters of $S$. Suppose that for a particular string $S$, the algorithm looks only at the characters in positions $i sub 1 ^,...,^ i sub r$. Then, for any of the $q sup n-r$ strings $S prime ~=~ s prime sub 1 ^...^ s prime sub n$ which have $s prime sub {i sub j} ~=~ s sub {i sub j}$, $1 <= j <= r$, the algorithm will examine the same characters as it does in examining $S$, and will come to the same conclusion as to whether $A$ is a subword of $S$ or not. Since $r ~<=~ n-1$, $q ~ | ~ q sup n-r$, and so the total number of strings that do not contain $A$ as a subword is divisible by $q$. This can of course happen for some values of $n$. The generating function (4.2) shows that this cannot happen for too many values of $n$. .P Let $f(n)$ be the number of strings of length $n$ that do not contain $A$ as a subword, with $f(0) ~=~ 1$. Then (4.2) implies that if .DS 3 .EQ (5.4) 1+ (z-q) ~AA sub z ~=~ z sup k ~-~ sum from j=0 to k-1 ~h sub j z sup j ^, .EN .DE then .DS 3 .EQ f(n) ~=~ sum from j=0 to k-1 ~h sub j f(n-k+j) ~~~roman for ~~n >= k ~. .EN .DE But by (5.4), $h sub 0 ~==~ 1 ~( roman mod ~q)$, so .DS 3 .EQ (5.5) f(n-k) ~==~ f(n) ~-~ sum from j=1 to k-1 ~h sub j f(n-k+j) ~~~( roman mod ~q) .EN .DE for $n ~>=~ k$. If for some $n$ we had $f(n) ~==~ f(n-1) ^ == ... == ^ f(n-k+1) ~==~ 0 ~( roman mod ~q)$, then by (5.5) we would have $f(n-k) ~==~ 0 ~( roman mod ~q)$, and by induction $f(m) ~==~ 0 ~( roman mod ~q)$ for all $m$, $0 <= m <= n$. But that contradicts $f(0) ~=~ 1$. .P The conclusion to be drawn from the above discussion is that among any $k$ consecutive values of $n$, there is at least one for which $f(n)$ is not divisible by $q$. For such a value of $n$, any algorithm will have to examine all $n$ characters of some string $S$ to determine whether $A$ is a subword of $S$ or not. Since the number of characters that have to be examined in the worst case is a monotone function of $n$, this also shows that for any $n$, at least $n-k+1$ characters have to be examined in some string $S$, and so there is no algorithm that is sublinear in the worst case. .H 2 "Unavoidable sets of words" The final application to be discussed in this section is to the problem of determining when a set of words is \f2unavoidable\f1; i.e., any sufficiently long string over the given alphabet contains at least one of these words as a subword. As an example, over the binary alphabet, the set $"{" 000, ~1111,~ 01 "}"$ is unavoidable. How can we recognize when a set is unavoidable? One way is to use Theorem 4.1. If $"{" A,B ^,...,^ T "}"$ is the set in question, we can use the system (4.8) to determine $F(z)$, the generating function for the number of strings that contain none of $"{" A ^,...,^ T "}"$ as subwords. This set of words is then unavoidable if and only if $F(z)$ is a polynomial in $z sup -1$. .P Another way to determine unavoidability of a set of words is through the construction of a finite-state automaton recognizing the words $A ^,...,^ T$. The two approaches seem to be of comparable complexity, but the one presented above might be somewhat simpler to implement. .P The main reason for discussing the problem of determining unavoidability here is that it points out some of the limitations of the use of generating functions in enumeration. Unavoidable sets $"{" A ^,...,^ T "}"$ lead to the collapse of the generating functions $F(z)$ to polynomials in $z sup -1$. Yet almost all the basic asymptotic enumeration results, such as those of Section 6, depend on the use of Lemma 2.1, which depends on the generating function having a single dominant singularity away from 0. Thus we cannot expect to be able to apply the simple method of Lemma 2.1 even to all the generating functions determined by Theorem 4.1. .H 1 "Asymptotic enumeration of strings" The end of the preceding section concluded with remarks about the limitations on the use of generating functions in asymptotic string enumeration. This section demonstrates that in spite of these limitations, numerous interesting asymptotic estimates can be obtained. The basic method relies on nothing more sophisticated than Lemma 2.1 and Rouche's theorem. However, usually substantial work is needed to apply these basic tools. .P Perhaps the simplest case to consider is that of counting the number of strings, $f(n)$, of length $n$ over an alphabet of size 2 that do not contain some fixed pattern $A$ as a subword. (The restriction to $q = 2$ is imposed here for the sake of clarity, to make it easier to compare various quantities.) By (4.2), the generating function of $f(n)$ has the form .DS 3 .EQ (6.1) F(z) ~=~ {z AA sub z} over {1+ (z-2) AA sub z} ~. .EN .DE To apply Lemma 2.1, we need to show that $F(z)$ has a single dominant pole. To do this, we need to study the zeros of the polynomial in the denominator on the right side of (6.1). As it turns out, the only feature of the denominator polynomial that we will use is that $AA sub z$ has all its coefficients 0 and 1. It is easy to obtain the following result (cf. Lemma 2 of [11]). .sp \f2Lemma 6.1\f1. .I There is a constant $c sub 1 ~>~ 0$ such that if $u(z)$ is a polynomial, .DS 3 .EQ (6.2) u(z) ~=~ z sup k-1 ~+~ sum from j=2 to k ~u sub j z sup k-j ~~,~~ u sub j ~=~ 0 ~italic or ~1 ^, .EN .DE then for $| z | ~>=~ 1.7$, .DS 3 .EQ (6.3) | u (z) | ~>~ c sub 1 ~(1.7) sup k ~. .EN .DE .R .P Using Lemma 6.1, it is easy to show that $F(z)$ has a single dominant singularity (cf. Lemma 3 of [11]). .sp \f2Lemma 6.2\f1. .I There is a constant $c sub 2 ~>~ 0$ such that if $u(z)$ satisfies the conditions of Lemma 6.1 and $k ~>=~ c sub 2$, the $1+(z-2) ^ u(z)$ has exactly one zero $rho$ (of multiplicity one) with $| rho | ~>=~ 1.7$. .R .sp \f2Proof\f1. On the circle $| z | ~=~ 1.7$, .DS 3 .EQ | z-2 | ^ | u(z) | ~>~ 0.3 c sub 1 (1.7) sup k ~>~ 1 ~~roman for ~~ k ~>=~ c sub 2 ~. .EN .DE Hence by Rouche's theorem, $(z-2) ^ u(z)$ and $1+(z-2) ^ u(z)$ have exactly the same number of zeros in $| z | ~>=~ 1.7$, namely one. \(sq .P It is also possible to localize the dominant zero $rho$ quite precisely. In a neighborhood of $z=2$, $u(z)$ is essentially $u(2)$, so we expect the zero $rho$ to be close to the solution of $1+(z-2) ^ u(2) ~=~ 0$; i.e., $rho ~approx~ 2 ~-~ u(2) sup -1$. This can be shown to be true, and we obtain the following result. (A more precise estimate is contained in Lemma 4 of [11].) .sp \f2Lemma 6.3\f1. .I If $u(z)$ satisfies the conditions of Lemma 6.1, and $k ~>=~ c sub 2$, then the single zero $rho$ of $1+ (z-2) ^ u(z)$ in $| z | ~>=~ 1.7$ satisfies .DS 3 .EQ rho ~=~ 2 ~-~ 1 over {u(2)} ~-~ {u prime (2)} over {u(2) sup 3} ~+~ O( k sup 2 2 sup -3k ) ~~~italic as ~~ k ~->~ inf ~. .EN .DE .R .P On the circle $| z | ~=~ 1.7$, the generating function $F(z)$ of (6.1) is $O(1)$ by Lemma 6.1. Hence Lemma 2.1 applies directly, and we find that (cf. Eq.\ (7.3) of [13]) .DS 3 .EQ (6.4) f(n) ~=~ {rho sup n} over {1-(2- rho ) sup 2 u prime ( rho )} ~+~ O( (1.7) sup n ) ~~~roman as ~~ n ~->~ inf ^, .EN .DE where $u(z) ~=~ AA sub z$. .P Estimates of the form (6.4) can be used in a variety of applications. We will discuss only the case of prefix-synchronized codes. A code $scrC$ of block length $n$ (over the binary alphabet, say) is said to be \f2comma-free\f1 if for every pair $( a sub 1 ^,...,^ a sub n )$, $(b sub 1 ^,...,^ b sub n )$ of codewords in $scrC$, no subword of length $n$ of their concatenation $( a sub 1 ^,...,^ a sub n , ~b sub 1 ^,...,^ b sub n )$, other than the first and last blocks of length $n$, is a codeword of $scrC$. When such a code is used in a communication system, the receiver can always recover synchronization by examining the transmitted signal until a codeword is encountered. It is known [5,16] that at least for odd $n$, comma-free codes of length $n$ can be found which are of size .DS 3 .EQ (6.5) n sup -1 ~2 sup n (1+ o(1)) ~~roman as ~~ n ~->~ inf ~. .EN .DE One disadvantage of general comma-free codes, though, is that it is usually hard to decide whether a word is in the code. In order to overcome this disadvantage, E. N. Gilbert invented \f2prefix-synchronized codes\f1, a special class of comma-free codes. Such a code is characterized by a prefix $A$ of length $k ~<~ n$. Every codeword has $A$ as a prefix, and the remaining $n-k$ characters of the codewords are chosen so that in any concatenation of codewords, $A$ occurs only at the beginnings of codewords. Synchronization is then very easy, since all that is needed is to scan the incoming stream of characters until the prefix $A$ is encountered, and that can be done very efficiently using string searching algorithms. The question then arises as to the efficiency of prefix-synchronized codes. Gilbert [8] showed that one can construct such codes of size .DS 3 .EQ n sup -1 ~2 sup n-1 (1+ o(1)) ~ log~ 2 ~wig~ (0.346...) n sup -1 2 sup n ~~roman as ~~ n ~->~ inf ^, .EN .DE which is not much less than the bound (6.5) attainable (at least for odd $n$) by maximal comma-free codes. Gilbert also conjectured that optimal prefixes can be chosen to be of the form $A ~=~ 11...110$ for an appropriate length $k$. This conjecture has been shown to be true, and a more precise estimate of the size of the largest prefix-synchronized code has been obtained [11]. .sp \f2Theorem 6.1\f1. .I Given a positive integer $n$, choose the unique integer $k ~=~ k(n)$ so that $beta ~=~ n 2 sup -k$ satisfies .DS 3 .EQ log ~2 ~<=~ beta ~<~ 2 ~ log ~2 ~. .EN .DE Then the maximal prefix-synchronized code of length $n$ has cardinality .DS 3 .EQ n sup -1 2 sup n ~beta e sup {- beta} ~(1+ o(1)) ~~roman as ~~ n ~->~ inf ~. .EN .DE Moreover, if $n$ is large enough, maximal prefix-synchronized codes can be constructed with prefixes $A ~=~ 11...10$ of length $k-1, ~k$, or $k+1$. .R .P The results of Theorem 6.1 can be generalized (see [11]) to non-binary alphabets. However, if the alphabet is of size $q ~>=~ 5$, then it is not always possible to construct maximal codes using prefixes of the form $A ~=~ 11...10$. For all alphabets, the phenomenon of asymptotic oscillation holds, so that the size of the maximal prefix-synchronized codes is not asymptotic to a constant multiple of $n sup -1 q sup n$. For example, Theorem 6.1 shows that for $q =2$, the ratio of the maximal code size to $n sup -1 2 sup n$ oscillates between $( log ~2 )/2 ~=~ 0.3465...$ and $e sup -1 ~=~ 0.3678...$. .P The proof of Theorem 6.1 relies on the generating function formula (4.10) and the stronger forms of lemmas 6.1-6.3 that are presented in [11]. Note first of all that given a prefix $A$, the allowable codewords are precisely those strings of length $n$ which, when $A$ is adjoined to them at the right end, have $A$ as a prefix and a suffix, but do not contain $A$ as a subword elsewhere. Hence, in the notation of Section 4, the maximal size of a code of length $n$ with prefix $A$ is just $g(n+ | A | )$, for which we have the generating function (4.10). In particular, this formula shows that the size of the maximal code determined by a prefix $A$ depends only on the autocorrelation of $A$, and not on $A$ itself. The proof of Theorem 6.1 is quite involved, but in principle quite straightforward, involving very careful analysis. .P So far we have discussed asymptotic enumeration results only in cases where we were basically excluding a single word $A$, and so the generating functions ((4.2) and (4.10)) had denominators of the simple form $1+ (z-q) AA sub z$. When we count strings that exclude larger numbers of words, we cannot expect our denominator to be this nice and easy to deal with. The unavoidable sets of words discussed at the end of Section 5 show that in fact quite pathological behavior can occur. However, in some cases it is possible to obtain sharp asymptotic estimates from Theorem 4.1. That is the case with repetitive runs, which were studied extensively in [15]. For simplicity we will again consider only the binary alphabet. Let $B ~=~ b sub 1 ^...^ b sub n$ be a word that is nonperiodic; i.e., $B$ cannot be written as $B ~=~ C C ^...^ C$ for any word $C$ shorter than $B$. (As an example, 010 is nonperiodic while 0101 is not.) A \f2B-run\f1 of length $k$ is a word $A ~=~ BB ^...^ B B prime$ of length $k$, where $B prime$ is a prefix of $B$. For example, if $B ~=~ 010$, then 01001 is a $B$-run of length 5. We will discuss the appearance of long $B$-runs in random sequences, when $B$ is held fixed. In many situations, a much more interesting question to study is that of appearance of $B sup *$-runs. A $~B sup *$\f2-run\f1 of length $k$ is a word $A ~=~ B {prime prime} BB ^...^ B B prime$, where $B prime$ is a prefix of $B$ and $B {prime prime}$ is a suffix of $B$. Thus, if $B ~=~ 010$, then 001001 is a $B sup *$-run of length 6. .P The theory of $B$-runs follows easily from the results we have already discussed. If $A ~=~ BB ^...^ B B prime , ~ | A | ~=~ k$, and $f(n)$ denotes the number of binary strings of length $n$ that do not contain $A$, then the generating function for $f(n)$ is given by (4.2). Moreover, since $B$ is nonperiodic, it is easy to show that .DS 3 .EQ AA sub z ~=~ z sup k-1 ~+~ z sup k-m-1 ~+~ z sup k-2m-1 ^+...+^ z sup k-rm-1 ~+~ p(z) ^, .EN .DE where $r ~=~ [k/m]-1$ and $roman deg ~ p(z) ~<~ m$, so we obtain the following result (Theorem 2.2 of [15]). .sp \f2Theorem 6.2\f1. .I If $B$ is a nonperiodic pattern of length $m$, and $A$ is a B-run of length $k$, then the generating function of $f(n)$, the number of strings of length $n$ that do not contain $A$, is of the form .DS 3 .EQ F(z) ~=~ {zQ(z)} over {1+(z-2) Q(z)} ~, .EN .DE where .DS 3 .EQ Q(z) ~=~ {z sup k+m-1} over {z sup m -1} ~+~ q(z) over {z sup m -1} ~, .EN .DE and $q(z)$ is a polynomial of degree $< 2m$ with coefficients bounded by 2 in absolute value. .R .P The form of $F(z)$ given by Theorem 6.2 is sufficiently explicit to obtain very good estimates for the dominant singularity $rho$ and the residue at $rho$ so that .DS 3 .EQ (6.6) f(n) ~approx~ 2 sup n ~exp (- n (1- 2 sup -m ) 2 sup -k ) .EN .DE is a very good approximation. (See [15] for more precise estimates.) .P The theory of $B sup *$-runs is considerably more complicated. The number of strings of length $n$ that do not contain a $B sup *$-run of length $k$ is the number $f(n)$ of Theorem 4.1 when we exclude the $m$ words $A(1) ^,...,^ A(m)$ of length $k$ given by .DS 3 .EQ A(r) ~=~ b sub m-r+1 ~b sub m-r+2 ^...^ b sub m ~BB ^...^ B B prime ^, .EN .DE where $B prime ~=~ B prime (r)$ is a prefix of $B$ of the appropriate length. The denominator of the generating function $F(z)$ of $f(n)$ is then the determinant of an $m+1$ by $m+1$ matrix. The entries of that matrix are very strongly constrained; for example, if $1 <= r <= s <= m$, then it can be shown that .DS 3 .EQ A(r) A(s) sub z ~=~ {z sup k-1-s+r ~+~ g sub s,r (z)} over {z sup m -1} ~, .EN .DE where $g sub s,r (z)$ is a polynomial of degree $< 2m$ with coefficients that are $<= 2$ in absolute value. Therefore by going through some very involved algebra, one can obtain a relatively explicit form for $F(z)$ (Theorem 2.3 of [15]). .sp \f2Theorem 6.3\f1. .I If $B$ is a nonperiodic word of length $m$, then the generating function $F(z)$ of $f(n)$, the number of binary strings of length $n$ that contain no $B sup *$-run of length $k ~>=~ 3m$ has the form .DS 3 .EQ F(z) ~=~ {z sup (k+1)m+1 ~(z sup m -1) sup m-1 ~+~ g sub 1 (z)} over {(z-2) z sup (k+1)m ( z sup m -1) sup m-1 ~+~ g sub 2 (x)} ~. .EN .DE Here $g sub 1 (z)$ and $g sub 2 (z)$ are polynomials of degrees $<=~ k(m-1) + c sub 1$ with coefficients that are $<= c sub 2$ in absolute value, and $c sub 1$ and $c sub 2$ are constants that depend only on $m$, but not on $k$. Furthermore, .DS 3 .EQ g sub 1 (2) ~mark =~ O(2 sup k(m-1) )^, .EN .sp .EQ g sub 2 (2) ~lineup =~ m 2 sup km-k+m (2 sup m -1) sup m-1 ~+~ O( 2 sup k(m-2) ) ^, .EN .DE where the constants implied by the O-notation depend only on $m$. .R .P Once Theorem 6.3 is available, it is not too difficult to show that $F(z)$ has a dominant singularity and to estimate its location and residue. One finds that .DS 3 .EQ (6.7) f(n) ~approx~ 2 sup n ~exp ( - nm 2 sup -k-1 ) .EN .DE is a very good approximation in this case (cf. (6.6), which corresponds to $B$-runs). .P Once estimates like (6.6) and (6.7) (or rather, the more precise estimates established in [15]) are obtained, one can obtain a variety of results from them. For example, of $E sub B (n)$ denotes the expected length of a $B$-run in a random string of length $n$, and $E sub B sup * (n)$ the expected length of a $B sup *$-run, then [15] .DS 3 .EQ E sub B (n) ~mark =~ roman lg ~ n ~-~ 1/2 ~+~ gamma ( log ~2 ) sup -1 ~+~ roman lg (1- 2 sup -m ) .EN .sp .EQ ~lineup +~ v( roman lg ~ n ~+~ roman lg (1- 2 sup -m )) ~+~ o(1) ~~roman as ~~ n ~->~ inf ^, .EN .DE .DS 3 .EQ E sub B sup * (n) ~mark =~ roman lg ~n ~-~ 3/2 ~+~ gamma ( log ~2 ) sup -1 ~+~ roman lg ~m .EN .sp .EQ ~lineup +~ v( roman lg ~n ~+~ roman lg ~m ) ~+~ o(1) ~~roman as ~~ n ~->~ inf ^, .EN .DE where $roman lg ~x$ is the logarithm of $x$ to base 2, $gamma$ is Euler's constant (0.577...), and $v(x)$ is a certain nonconstant continuous periodic function of period 1 and mean 0. (There is a mistake in [15] in the statement of the result about $E sub B (n)$, with -1/2 incorrectly replaced by -5/2.) When $B$ is a single letter, such results were obtained first by Boyd [2]. .PH "" .bp .ce .B References .R .sp .AL .LI A. Apostolico and A. S. Fraenkel, Fibonacci representations of strings of varying length using binary separators, to be published. .LI D. W. Boyd, Losing runs in Bernoulli trials, unpublished manuscript (1972). .LI R. S. Boyer and J. S. Moore, A fast string searching algorithm, \f2Comm. ACM\f1 \f220\f1 (1977), 762-772. .LI N. G. de Bruijn, \f2Asymptotic Methods in Analysis\f1, North-Holland 1958. .LI W. L. Eastman, On the construction of comma-free codes, \f2IEEE Trans. Information Theory\f1 \f2IT-11\f1 (1965), 263-266. .LI A. S. Fraenkel, The use and usefulness of numeration systems, to be published. .LI H. U. Gerber and S. R. Li, The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain, \f2Stoch. Proc. Appl. 11\f1 (1981), 101-108. .LI E. N. Gilbert, Synchronization of binary messages, \f2IRE Trans. Inform. Theory\f1 \f2IT-6\f1 (1960), 470-477. .LI I. P. Goulden and D. M. Jackson, \f2Combinatorial Enumeration\f1, Wiley 1983. .LI L. J. Guibas, Periodicities in strings, to be published in this volume. .LI L. J. Guibas and A. M. Odlyzko, Maximal prefix-synchronized codes, \f2SIAM J. Appl. Math.\f1 \f235\f1 (1978), 401-418. .LI L. J. Guibas and A. M. Odlyzko, A new proof of the linearity of the Boyer-Moore string searching algorithm, pp. 189-195 in Proc. 18th IEEE Found. Comp. Sci. Symposium, 1977. Revised version in \f2SIAM J. Comput.\f1 \f29\f1 (1980), 672-682. .LI L. J. Guibas and A. M. Odlyzko, String overlaps, pattern matching, and nontransitive games, \f2J. Comb. Theory (A)\f1 \f230\f1 (1981), 183-208. .LI L. J. Guibas and A. M. Odlyzko, Periods in strings, \f2J. Comb. Theory (A)\f1 \f230\f1 (1981), 19-42. .LI L. J. Guibas and A. M. Odlyzko, Long repetitive patterns in random sequences, \f2Z. Wahrscheinlichkeits. v. Geb.\f1 \f253\f1 (1980), 241-262. .LI B. H. Jiggs, Recent results in comma-free codes, \f2Canad. J. Math.\f1 \f215\f1 (1963), 178-187. .LI D. E. Knuth, J. H. Morris, and V. R. Pratt, Fast pattern matching in strings, \f2SIAM J. Comput.\f1 \f26\f1 (1977), 323-350. .LI K. K. Kubota, On a conjecture of Morgan Ward, I and II, \f2Acta Arith. 33\f1 (1977), 27-48. .LI P. T. Nielsen, On the expected duration of a search for a fixed pattern in random data, \f2IEEE Trans. Inform Theory IT-19\f1 (1973), 702-704. .LI A. M. Odlyzko, Some new methods and results in tree enumeration, \f2Congressus Numerantium\f1 \f242\f1 (1984), 27-52. .LI A. Odlyzko and B. Richmond, On the compositions of an integer, pp. 199-210 in \f2Combinatorial Mathematics VII\f1, Proc. Seventh Austr. Conf. Comb. Math., R. W. Robinson, G. W. Southern, and W. D. Wallis, eds., Lecture Notes in Mathematics #829, Springer 1980. .LI W. Penney, Problem: penney-ante, \f2J. Recreational Math.\f1 \f22\f1 (1969), 241. .LI P. Re\*'ve\*'sz, Strong theorems on coin tossing, pp. 749-754 in \f2Proc. 1978 Int'l. Congress of Mathematicians\f1, Helsinki, 1980. .LI R. L. Rivest, On the worst-case behavior of string-searching algorithms, \f2SIAM J. Comput.\f1 \f26\f1 (1977), 669-674. .LI S. W. Roberts, On the first occurrence of any of a selected set of outcome patterns in a sequence of repeated trials, unpublished Bell Laboratories report (1963). .LI A. D. Solov'ev, A combinatorial identity and its application to the problem concerning the first occurrences of a rare event, \f2Theory Prob. Appl.\f1 \f211\f1 (1966), 276-282. .LE