Simulating Vertica's conditional_change_event

14. September 2017 20:19 by Jay Grossman in   //  Tags: , , , ,   //   Comments (0)

Lately my team has spent a bunch of time migrating our data warehouse from Vertica to Snowflake. While Snowflake has excellent support for analytic functions, Vertica has some functions that no other columnar database supports. 

The conditional change event function "assigns an event window number to each row, starting from 0, and increments by 1 when the result of evaluating the argument expression on the current row differs from that on the previous row".

CONDITIONAL_CHANGE_EVENT ( expression ) OVER ( 
... [ window_partition_clause ] 
... window_order_clause  )

 

An example using the conditional_change_event function

So let's say we have the following table CCE_demo: 

This table contains chronological visits by users using a single browser. There were 2 visits from uid=10137196, then 2 visits by uid=15479000 and then 2 more visits by uid=10137196.  

We are interested in identifying each of these 3 groups of visits. A simple "GROUP BY uid" would result in 2 groups.

  SELECT
      browser_id
     , browser_visit_start
     , uid
     , (conditional_change_event(uid) over (partition by browser_id order by browser_visit_start asc) + 1) as cluster_id
 FROM CCE_Demo

The above SQL in Vertica would result in the following: 

 


How we can do this without conditional_change_event function

1) we can define the first element in each group of visits with cluster_start=1 (using the LAG analytic function):

-- identify the first record in each group
  SELECT

      browser_id
     , browser_visit_start
     , uid
      , CASE
          WHEN
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) IS NULL THEN 1
          WHEN
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) != uid THEN 1
      ELSE 0
      END  as cluster_start
FROM CCE_Demo

 

2) Identify the unique groups:

-- identify the first record in each group
WITH a as (
 SELECT
      browser_id
     , browser_visit_start
     , uid
      , CASE
          WHEN
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) IS NULL THEN 1
          WHEN
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) != uid THEN 1
      ELSE 0
      END  as cluster_start
 FROM CCE_DEMO
)
-- get the unique groups
  SELECT
        browser_id
       , browser_visit_start
       , uid
        , cluster_start
       , ROW_NUMBER() over (partition by browser_id order by browser_visit_start) as cluster_id
   FROM a
   WHERE  cluster_start=1
   ORDER BY browser_id, browser_visit_start, uid 

 

3) We can create date windows for each cluster since we know when each one starts:

-- identify the first record in each group
WITH a as (
 SELECT 
      browser_id
     , browser_visit_start
     , uid 
      , CASE 
          WHEN 
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) IS NULL THEN 1 
          WHEN
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) != uid THEN 1 
      ELSE 0 
      END  as cluster_start
 FROM CCE_Demo
),
-- get the unique groups
b as (
  SELECT 

        browser_id
       , browser_visit_start
       , uid 
        , cluster_start
       , ROW_NUMBER() over (partition by browser_id order by browser_visit_start) as cluster_id
   FROM a
   WHERE  cluster_start=1
   ORDER BY browser_id, browser_visit_start, uid  
)
-- assign end dates to the groups, set last group's end date to 2100-01-01 
   SELECT
        browser_id
       , browser_visit_start
       , uid
       , cluster_id
       , COALESCE(
             LEAD(browser_visit_start) over
              (partition by browser_id order by browser_visit_start asc),
             '2100-01-01') as end_date
         FROM b

 

4) Use the date ranges (between browser_visit_start and end_date) to designate the group:

-- identify the first record in each group
WITH a as (
 SELECT 
      browser_id
     , browser_visit_start
     , uid 
      , CASE 
          WHEN 
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) IS NULL THEN 1 
          WHEN 
             LAG(uid) over (partition by browser_id order by browser_visit_start asc) != uid THEN 1 
      ELSE 0 
      END  as cluster_start
 FROM CCE_Demo
),
-- get the unique groups
b as (
  SELECT 

        browser_id
       , browser_visit_start
       , uid 
        , cluster_start
       , ROW_NUMBER() over (partition by browser_id order by browser_visit_start) as cluster_id
   FROM a
   WHERE  cluster_start=1
   ORDER BY browser_id, browser_visit_start, uid  
),
-- assign end dates to the groups, set last group's end date to 2100-01-01 
c as
(
   SELECT 
        browser_id
       , browser_visit_start
       , uid
       , cluster_id
       , COALESCE(
             LEAD(browser_visit_start) over 
              (partition by browser_id order by browser_visit_start asc),
             '2100-01-01') as end_date 
         FROM b
)
-- use the window between browser_visit_start and end_date to assign the correct group to each record
SELECT
    d.browser_id
    , d.browser_visit_start
    , d.uid
    , cluster_id
FROM CCE_Demo d
LEFT OUTER JOIN c
    ON d.browser_id=c.browser_id
    AND d.uid=c.uid
    AND d.browser_visit_start>=c.start_date
    AND d.browser_visit_start<c.end_date
ORDER BY 
    d.browser_id, d.browser_visit_start 

 

  

Other things we could have done

The real data we have in our database that this example was based off of a table with 250 million rows. So taking advantage of scalable query processing in an MPP database was desirable.

1) Build a custom user defined function (using javascript and a FOR loop).
2) Export it to a language like Python or Java and use a FOR loop. 

 

Comments 

About the author

Jay Grossman

techie / entrepreneur that enjoys:
 1) my kids + awesome wife
 2) building software projects/products
 3) digging for gold in data sets
 4) rooting for my Boston sports teams:New England PatriotsBoston PatriotsBoston Red SoxBoston CelticsBoston Bruins

Month List