Find Consecutive Alphabetical Characters in String — From SQL to SPL #28

esProcesProc
2 min read

Problem description & analysis:

The field of a certain database table is strings.

Task: Now we need to find the strings that contain at least 5 consecutive letters sorted alphabetically, or in other words, the longest substring in a continuous ascending order with a length greater than or equal to 5. Note to exclude non-letter characters.

Code comparisons:

SQL

SELECT value
FROM   table_name t
       CROSS JOIN LATERAL (
         SELECT MIN(lvl) AS start_pos
         FROM   (
           SELECT LEVEL AS lvl,
                  SUBSTR(t.value, LEVEL, 1) AS ch
           FROM   DUAL
           CONNECT BY LEVEL <= LENGTH(t.value)
         )
         MATCH_RECOGNIZE(
           ORDER BY lvl
           MEASURES
             FIRST(lvl) AS lvl
           PATTERN (first_row consecutive{3,})
           DEFINE first_row   AS (ch BETWEEN 'A' AND 'Z' OR ch BETWEEN 'a' AND 'z'),
                  consecutive AS (ch BETWEEN 'A' AND 'Z' OR ch BETWEEN 'a' AND 'z')
                                 AND ASCII(PREV(ch)) + 1 = ASCII(ch)
         )
         HAVING MIN(lvl) > 0
       )

SQL needs to use nested subqueries and CONNECT BY to simulate sequences, and then use MATCH_RECOGNIZE to process sequences. The code is cumbersome and difficult to understand.

SPL: SPL directly provides sequence calculation functions:

👉🏻 Try.DEMO

A1: Load data.

A2: Split the string into a sequence by character, group the sequence, and start a new group when the current member is less than or equal to the previous member or the previous member is not a letter. Select the string with the longest group that has a length greater than or equal to 5. The function group can keep the grouped subsets for subsequent processing after grouping, rather than immediately aggregating them. @i represents grouping according to conditions.


Get Started with esProc SPL — esProc SPL FREE Download.

0
Subscribe to my newsletter

Read articles from esProc directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

esProc
esProc

esProc SPL is a JVM-based programming language designed for structured data computation, serving as both a data analysis tool and an embedded computing engine. FREE download👉🏻: https://www.esproc.com/download-esproc