Periodicity of strings

This is a quick introduction to several properties related to the periodicity of strings. Some of these concepts can be useful in understanding the performance of string algorithms, particularly being used to prove that the Boyer-Moore algorithm runs in linear time.

  • Definition 1. A string SS is periodic if S=BBBBB..BS = BBBBB..B for some shorter string BB. The length of BB is called the period of the string.

  • Definition 2. A string SS is suffix semiperiodic if S=BBBB..BαS = BBBB..B\alpha for some prefix α\alpha of BB.

  • Definition 3. A string SS is prefix semiperiodic if S=αBBBB..BS = \alpha BBBB..B for some suffix α\alpha of BB.

It may not be immediately obvious but any string that is prefix semiperiodic is also suffix semiperiodic with the same period. To prove this statement, let's assume S=αBBBBS = \alpha BBBB where α\alpha is a suffix of B. Thus BB can be written as B=βαB = \beta \alpha, and the full string can be rewritten as: S=αβαβαβαβαS=\alpha\beta\alpha\beta\alpha\beta\alpha\beta\alpha. By defining C=αβC=\alpha\beta (the string BB with its prefix and suffix swapped around), we can rewrite S=CCCCαS = CCCC\alpha, i.e., SS is suffix semiperiodic. The period is obviously the same since the lengths of BB and CC are identical (α+β|\alpha| + |\beta|).

It is also possible to show that if a string is semiperiodic with two different periods of lengths p1p_1 and p2p_2 it is also semiperiodic with the greatest common denominator of the two periods GCD(p1,p2)GCD(p_1, p_2).

Question: Prove the claim made in the previous paragraph: A string that is semiperiodic with two periods, it's also semiperiodic with the greatest common denominator of the two periods.

So, what does periodicity have to do with sequence matching? Assume that a pattern PP of length mm matches at two overlapping positions in the text, t1t_1 and t2t_2, and that these positions are spaced by less than half the length of the pattern (t1t2<m2|t_1 - t_2| < \lfloor {m \over 2 } \rfloor ). In this case, we can show that PP is semiperiodic with period t1t2|t_1 - t_2|. To prove this claim you can draw two instances of the pattern aligned to each other such that they overlap by more than 50% of the pattern's length, then carefully map which segments of the pattern match each other, as inferred from the overlapping region.

Question: Prove the claim made in the previous paragraph, rephrased as: If a string overlaps itself by more than half of its length (a prefix of the string longer than half its length perfectly matches a corresponding suffix), then the string is semiperiodic with a period equal to the length of the string not contained in the overlap.

This property can be useful when reasoning about the behavior of string matching algorithm, and is used, for example, to prove that the Boyer-Moore algorithm runs in linear time.

Last updated