Quantcast
Viewing all articles
Browse latest Browse all 426

Finding all Palindromes Contained in Strings with SQL

SQL is a really cool language. I can write really complex business logic with this logic programming language. I was again thrilled about SQL recently, at a customer site:

But whenever I tweet something like the above, the inevitable happened. I was nerd sniped. Oleg Šelajev from ZeroTurnaround challenged me to prove that SQL is so awesome:

Given a string, find all substrings from that string, which are palindromes. Challenge accepted! (For the moment, let’s forget about algorithmic complexity.)

Here’s the Full Query

Spoiler first.

Bear with me, I’ll explain it step by step afterwards. Here’s with PostgreSQL syntax:

WITH RECURSIVE
  words (word) AS (
    VALUES
      ('pneumonoultramicroscopicsilicovolcanoconiosis'),
      ('pseudopseudohypoparathyroidism'),
      ('floccinaucinihilipilification'),
      ('antidisestablishmentarianism'),
      ('supercalifragilisticexpialidocious'),
      ('incomprehensibilities'),
      ('honorificabilitudinitatibus'),
      ('tattarrattat')
  ),
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  ),
  palindromes (word, palindrome, start, length) AS (
    SELECT word, substring(word, start, x), start, x
    FROM starts CROSS JOIN (VALUES(0), (1)) t(x)
    UNION ALL
    SELECT word, palindrome, start, length + 2
    FROM (
      SELECT
        word,
        substring(word, start - length / 2, length) AS palindrome,
        start, length
      FROM palindromes
    ) AS p
    WHERE start - length / 2 > 0
    AND start + (length - 1) / 2 <= length(word)
    AND substring(palindrome, 1, 1) =
        substring(palindrome, length(palindrome), 1)
  )
SELECT DISTINCT
  word,
  trim(replace(word, palindrome, ' ' || upper(palindrome) || ' '))
    AS palindromes
FROM palindromes
WHERE length(palindrome) > 1
ORDER BY 2

(You can run it yourself on SQLFiddle)

The result being:

word                                          |palindromes
----------------------------------------------|-----------------------------------------------
antidisestablishmentarianism                  |ant  IDI  sestablishmentarianism
antidisestablishmentarianism                  |antidi  SES  tablishmentarianism
floccinaucinihilipilification                 |flo  CC  inaucinihilipilification
floccinaucinihilipilification                 |floccinauc  INI  hilipilification
floccinaucinihilipilification                 |floccinaucin  IHI  lipilification
floccinaucinihilipilification                 |floccinaucinih  ILI  p  ILI  fication
floccinaucinihilipilification                 |floccinaucinih  ILIPILI  fication
floccinaucinihilipilification                 |floccinaucinihi  LIPIL  ification
floccinaucinihilipilification                 |floccinaucinihil  IPI  lification
floccinaucinihilipilification                 |floccinaucinihilipil  IFI  cation
honorificabilitudinitatibus                   |h  ONO  rificabilitudinitatibus
honorificabilitudinitatibus                   |honor  IFI  cabilitudinitatibus
honorificabilitudinitatibus                   |honorificab  ILI  tudinitatibus
honorificabilitudinitatibus                   |honorificabilitud  INI  tatibus
honorificabilitudinitatibus                   |honorificabilitudin  ITATI  bus
honorificabilitudinitatibus                   |honorificabilitudini  TAT  ibus
incomprehensibilities                         |incompr  EHE  nsibilities
incomprehensibilities                         |incomprehens  IBI  lities
incomprehensibilities                         |incomprehensib  ILI  ties
incomprehensibilities                         |incomprehensibil  ITI  es
pneumonoultramicroscopicsilicovolcanoconiosis |pneum  ONO  ultramicroscopicsilicovolcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopics  ILI  covolcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilic  OVO  lcanoconios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolca  NOCON  ios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolcan  OCO  nios
pneumonoultramicroscopicsilicovolcanoconiosis |pneumonoultramicroscopicsilicovolcanoconio  SIS
pseudopseudohypoparathyroidism                |pseudopseudohy  POP  arathyroidism
pseudopseudohypoparathyroidism                |pseudopseudohypop  ARA  thyroidism
pseudopseudohypoparathyroidism                |pseudopseudohypoparathyro  IDI  sm
supercalifragilisticexpialidocious            |supercalifrag  ILI  sticexpialidocious
tattarrattat                                  |t  ATTA  rr  ATTA  t
tattarrattat                                  |t  ATTARRATTA  t
tattarrattat                                  |ta  TT  arra  TT  at
tattarrattat                                  |ta  TTARRATT  at
tattarrattat                                  |tat  TARRAT  tat
tattarrattat                                  |TAT  tarrat  TAT
tattarrattat                                  |tatt  ARRA  ttat
tattarrattat                                  |tatta  RR  attat
tattarrattat                                  |TATTARRATTAT

This query uses a couple of nice features:

Common Table Expressions

They’re the only way to declare variables in SQL – i.e. you can “store” a query in such a table expression, assign a name to it, and reuse it several times. This is done with the WITH clause. The nice thing about common table expressions is that they are allowed to be RECURSIVE (depending on the database, the RECURSIVE keyword may be required / optional / not available).

VALUES() clause

The VALUES() clause is a very handy tool to create ad-hoc data in the form of tables. We did this to create a table called WORDS, which contains a couple of words inside of which we’d like to look for palindromes. In some databases (including DB2 and PostgreSQL), it’s totally possible to just use the VALUES() clause as a standalone clause instead of SELECT:

VALUES
  ('pneumonoultramicroscopicsilicovolcanoconiosis'),
  ('pseudopseudohypoparathyroidism'),
  ('floccinaucinihilipilification'),
  ('antidisestablishmentarianism'),
  ('supercalifragilisticexpialidocious'),
  ('incomprehensibilities'),
  ('honorificabilitudinitatibus'),
  ('tattarrattat')

Recursive Generation of Integers

In order to look for palindromes, the algorithm used here lists, for each word, the character index of each individual character in the word:

WITH RECURSIVE
  ...
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  ),
  ...

For example, if we used the word “word”…

WITH RECURSIVE
  words (word) AS (VALUES('word')),
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  )
SELECT *
FROM starts

We’d get the following result:

word |start
-----|------
word |1
word |2
word |3
word |4

The idea of the algorithm is that we start from a given character and “fan out” in both directions of a character, again recursively. The characters to the left and to the right of such a character must be the same and so on.

If you’re interested in the syntax, stay tuned. There will be a blog post coming up soon.

Using CROSS JOIN to run the Algorithm Twice

There are two types of palindromes:

  • Those with an even amount of characters: “”, “aa”, “abba”, “babbab”
  • Those with an odd amount of characters: “b”, “aba”, “aabaa”

In our algorithm, we’re treating them in almost the same way by running the algorithm twice. Whenever you think “hey, I need to run this twice in SQL”, CROSS JOIN could be a good candidate, as we’re creating a cartesian product between:

  • The previous “starts” table, giving the starting character index
  • A table containing values 0 (palindromes with even amounts of letters) and 1 (palindromes with odd amounts of letters)

For more information about CROSS JOIN, read our article about JOINs.

Run this query for illustration:

WITH RECURSIVE
  words (word) AS (VALUES('word')),
  starts (word, start) AS (
    SELECT word, 1 FROM words
    UNION ALL
    SELECT word, start + 1 FROM starts WHERE start < length(word)
  )
SELECT *
FROM starts CROSS JOIN (VALUES(0),(1)) AS t(x)

Output:

word |start |x
-----|------|--
word |1     |0  <-- Even length palindromes centering around position 1 length 0
word |1     |1  <-- Odd length palindromes centering around position 1 length 1
word |2     |0
word |2     |1
word |3     |0
word |3     |1
word |4     |0
word |4     |1

Trivially, a palindrome of length 0 (empty string) or length 1 (“w”, “o”, “r”, “d”) is acceptable in principle, but boring. We’ll filter them out later, but keep them in the algorithm as this simplifies the algorithm. If you want to tune the query, you could prevent generating them in the first place.

The Palindrome Algorithm

Thus far, we’ve just prepared utility data to calculate palindromes:

  • The words inside of which to search for palindromes
  • The individual character indexes to start fanning out from, in each individual word
  • The minimum palindrome length 0 (even) or 1 (odd)

Now comes the interesting part. Recursively fanning out from a starting character to find more palindromes, stopping the fanning out as soon as the new candidate is no longer a plaindrome:

WITH RECURSIVE
  ...
  palindromes (word, palindrome, start, length) AS (

    -- This part just creates the even/odd semantics
    SELECT word, substring(word, start, x), start, x
    FROM starts CROSS JOIN (VALUES(0), (1)) t(x)
    UNION ALL

    -- This part recurses by "fanning out"
    SELECT word, palindrome, start, length + 2
    FROM (
      SELECT
        word,
        substring(word, start - length / 2, length) AS palindrome,
        start, length
      FROM palindromes
    ) AS p
    WHERE start - length / 2 > 0
    AND start + (length - 1) / 2 <= length(word)
    AND substring(palindrome, 1, 1) =
        substring(palindrome, length(palindrome), 1)
  )
  ...

It isn’t really so hard in fact. The recursion part selects recursively from the PALINDROMES table (the previously calculated palindromes). That table has 4 columns:

  • WORD: The word we’re looking for palindromes in. This is always the same per recursion
  • PALINDROME: The palindrome, i.e. the substring inside of a word. This changes per recursion
  • START: The start character index from which we started fanning out. This is always the same per recursion
  • LENGTH: The palindrome length. This increases by 2 per recursion

Let’s look at the result for “floccinaucinihilipilification”:

flo  CC  inaucinihilipilification
floccinauc  INI  hilipilification
floccinaucin  IHI  lipilification
floccinaucinih  ILI  p  ILI  fication
floccinaucinih  ILIPILI  fication
floccinaucinihi  LIPIL  ification
floccinaucinihil  IPI  lification
floccinaucinihilipil  IFI  cation

There are a total of 8 distinct palindromes contained in this word (and believe it or not, the word exists, too. Pronunciation is another thing).

Let’s “debug” the algorithm to get to this list (remember, SQL indexes are 1 based):

Start 1-4: No palindromes
Start 5: Even palindrome [4:5]
  flo  CC  inaucinihilipilification

Start 6-11: No palindromes (I wont' repeat this further down)
Start 12: Odd palindrome [11:13]
  floccinauc  INI  hilipilification

Start 14: Odd palindrome [13:15]
  floccinaucin  IHI  lipilification

Start 16: Odd palindrome [15:17]
  floccinaucinih  ILI  p  ILI  fication

Start 18: Odd palindrome [17:19], [16:20], [15:21] (Fanning out 3 times)
  floccinaucinihil  IPI  lification
  floccinaucinihi  LIPIL  ification
  floccinaucinih  ILIPILI  fication

Start 20: Odd palindrome [19:17] (already found this)
  floccinaucinih  ILI  p  ILI  fication

Start 22: Odd palindrome [21:23]
floccinaucinihilipil  IFI  cation

The IPI, LIPIL, ILIPILI chain is the most interesting. We’ve succeeded to fan out 3 times adding new characters from WORD on both sides of the initial character.

When do we stop fanning out? Whenever one of these conditions hold true:

    WHERE start - length / 2 > 0
    AND start + (length - 1) / 2 <= length(word)
    AND substring(palindrome, 1, 1) =
        substring(palindrome, length(palindrome), 1)

I.e.

  • When we’ve reached the beginning of WORD (no more characters to the left)
  • When we’ve reached the end of WORD (no more characters to the right)
  • When the letter to the left of the palindrome of the previous recursion doesn’t match the letter to the right of the palindrome

That last predicate could also simply read:

    AND palindrome = reverse(palindrome)

But that might be a bit slower as it compares something that we’ve already proven to be true in the previous recursion.

Finally, formatting the result

The final part isn’t too interesting anymore:

SELECT DISTINCT
  word,
  trim(replace(word, palindrome, ' ' || upper(palindrome) || ' '))
    AS palindromes
FROM palindromes
WHERE length(palindrome) > 1
ORDER BY 2

We’ll simply:

  • Select DISTINCT results only, as palindromes might appear several times in a word
  • Replace the palindrome substring by its upper case version and add some whitespace to better visualise it. That’s totally optional, of course
  • Remove the trivial palindromes of length 0 and 1

And we’re done! Again, feel free to play around with this on SQLFiddle, or, much better, provide a cool palindrome algorithm implementation in your favourite language in the comments section, or on Twitter:

More beautiful SQL in these articles here:


Filed under: sql Tagged: Common Table Expressions, Palindromes, Recursive SQL, sql Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.

Viewing all articles
Browse latest Browse all 426

Trending Articles