Jay Grossman
Published on: 14 Sep 2017

Simulating Vertica's conditional_change_event

Lately my team has spent a bunch of time migrating our data warehouse from Vertica to Snowflake. While Snowflake has excellent support for analytic functions, Vertica has some functions that no other columnar database supports.

The conditional change event function "assigns an event window number to each row, starting from 0, and increments by 1 when the result of evaluating the argument expression on the current row differs from that on the previous row".

CONDITIONAL_CHANGE_EVENT ( expression ) OVER ( 
... [ window_partition_clause ] 
... window_order_clause  )

An example using the conditional_change_event function

So let's say we have the following table CCE_demo:

sf_vertica_1

This table contains chronological visits by users using a single browser. There were 2 visits from uid=10137196, then 2 visits by uid=15479000 and then 2 more visits by uid=10137196.

We are interested in identifying each of these 3 groups of visits. A simple "GROUP BY uid" would result in 2 groups.

SELECT
    browser_id
    , browser_visit_start
    , uid
    , (conditional_change_event(uid) 
        over (partition by browser_id order by browser_visit_start asc) + 1) 
        as cluster_id
FROM CCE_Demo

The above SQL in Vertica would result in the following:

sf_vertica_2

How we can do this without conditional_change_event function

TLDR; we will use a series of window functions.

1) we can define the first element in each group of visits with cluster_start=1 (using the LAG analytic function):

-- identify the first record in each group
SELECT
    browser_id
    , browser_visit_start
    , uid
    , CASE
        WHEN
            LAG(uid) over (partition by browser_id 
                order by browser_visit_start asc) IS NULL THEN 1
        WHEN
            LAG(uid) over (partition by browser_id 
                order by browser_visit_start asc) != uid THEN 1
    ELSE 0
    END  as cluster_start
FROM CCE_Demo

sf_vertica_3

2) Identify the unique groups:

-- identify the first record in each group
WITH a as (
    SELECT
        browser_id
        , browser_visit_start
        , uid
        , CASE
            WHEN
                LAG(uid) over (partition by browser_id 
                    order by browser_visit_start asc) IS NULL THEN 1
            WHEN
                LAG(uid) over (partition by browser_id 
                    order by browser_visit_start asc) != uid THEN 1
        ELSE 0
        END  as cluster_start
    FROM CCE_DEMO
)
-- get the unique groups
SELECT
        browser_id
    , browser_visit_start
    , uid
        , cluster_start
    , ROW_NUMBER() over (partition by browser_id 
        order by browser_visit_start) as cluster_id
FROM a
WHERE  cluster_start=1
ORDER BY browser_id, browser_visit_start, uid 

sf_vertica_4

3) We can create date windows for each cluster since we know when each one starts:

-- identify the first record in each group
WITH a as (
    SELECT 
        browser_id
        , browser_visit_start
        , uid 
        , CASE 
            WHEN 
                LAG(uid) over (partition by browser_id 
                    order by browser_visit_start asc) IS NULL THEN 1 
            WHEN
                LAG(uid) over (partition by browser_id 
                    order by browser_visit_start asc) != uid THEN 1 
        ELSE 0 
        END  as cluster_start
    FROM CCE_Demo
),
-- get the unique groups
b as (
    SELECT 
            browser_id
        , browser_visit_start
        , uid 
            , cluster_start
        , ROW_NUMBER() over (partition by browser_id 
            order by browser_visit_start) as cluster_id
    FROM a
    WHERE  cluster_start=1
    ORDER BY browser_id, browser_visit_start, uid  
)
-- assign end dates to the groups, set last group's end date to 2100-01-01 
SELECT
        browser_id
    , browser_visit_start
    , uid
    , cluster_id
    , COALESCE(
            LEAD(browser_visit_start) over
            (partition by browser_id order by browser_visit_start asc),
            '2100-01-01') as end_date
        FROM b

sf_vertica_5

4) Use the date ranges (between browser_visit_start and end_date) to designate the group:

-- identify the first record in each group
WITH a as (
    SELECT 
        browser_id
        , browser_visit_start
        , uid 
        , CASE 
            WHEN 
                LAG(uid) over (partition by browser_id 
                    order by browser_visit_start asc) IS NULL THEN 1 
            WHEN 
                LAG(uid) over (partition by browser_id 
                    order by browser_visit_start asc) != uid THEN 1 
        ELSE 0 
        END  as cluster_start
    FROM CCE_Demo
),
-- get the unique groups
b as (
    SELECT 
            browser_id
        , browser_visit_start
        , uid 
            , cluster_start
        , ROW_NUMBER() over (partition by browser_id 
            order by browser_visit_start) as cluster_id
    FROM a
    WHERE  cluster_start=1
    ORDER BY browser_id, browser_visit_start, uid  
),
-- assign end dates to the groups, set last group's end date to 2100-01-01 
c as
(
SELECT 
        browser_id
    , browser_visit_start
    , uid
    , cluster_id
    , COALESCE(
            LEAD(browser_visit_start) over 
            (partition by browser_id order by browser_visit_start asc),
            '2100-01-01') as end_date 
        FROM b
)
-- use the window between browser_visit_start and end_date to assign the correct group to each record
SELECT
    d.browser_id
    , d.browser_visit_start
    , d.uid
    , cluster_id
FROM CCE_Demo d
LEFT OUTER JOIN c
    ON d.browser_id=c.browser_id
    AND d.uid=c.uid
    AND d.browser_visit_start>=c.start_date
    AND d.browser_visit_start<c.end_date
ORDER BY 
    d.browser_id, d.browser_visit_start 

sf_vertica_6

Other things we could have done

The real data we have in our database that this example was based off of a table with 250 million rows. So taking advantage of scalable query processing in an MPP database was desirable.

1) Build a custom user defined function (using javascript and a FOR loop).
2) Export it to a language like Python or Java and use a FOR loop.

14 Sep 2017

« Snowflake's lateral flatten function on variant data type #1 news story on espn.com details something that happened on SportsCollectors.Net »

JayGrossman.com

Simulating Vertica's conditional_change_event

Popular Tags →