Counting Days: Time Intervals in SQL

https://xpil.eu/JyiUi

Some time ago, a colleague from the desk next to mine (virtual desk, since everyone is working from home, but a desk is a desk nevertheless) reached out to me for help. He had been struggling with an issue for three days, and I enjoy such puzzles. The problem turned out to be interesting enough that I decided to devote my full attention to it and make a blog entry about it, which you are now reading.

The topic seems trivial at first, but it quickly becomes more complicated.

The task is to calculate how many days each customer spent on various activities. In the simplest terms, we have a customer ID and the start and (optionally) end dates of the activity. Optionally, because some activities are still ongoing and have the end date set to NULL. In such cases, the current date should be used for calculations.

In reality, there are many more columns, but let’s not complicate things unnecessarily.

The catch lies in an additional condition: if a customer has more than one activity on a given day, that day is counted only once. So, for example, if customer A has three activities:

From January 1, 2024, to January 10, 2024
From January 5, 2024, to January 20, 2024
From February 1, 2024, to February 2, 2024

… the number of days should be calculated as:

The number of days from January 1 to January 20 (20 days)
Plus the number of days from February 1 to February 2 (2 days)
Total: 22 days

Simple, right? However, when we start writing the SQL code to calculate this, it quickly becomes evident that the task is not so simple. Especially since some customers have a lot of activities that can overlap in all possible ways. Also, the solution needs to be efficient because the data involves millions of records.

The simplest approach would be to construct a calendar table containing all dates and then count the unique dates within each customer’s activities. However, this approach, while logically correct and relatively simple, does not perform well on large datasets.

Therefore, we need to buckle down and…

We’ll start with the easy part, replacing all NULLs in the end dates with the current date:

WITH DateAdjusted AS (
    SELECT customer_id,
           start_date,
           CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date
      FROM activity
)
SELECT * FROM DateAdjusted;

What next?

Let’s think: we need to group each customer’s activities into overlapping groups, then find the earliest start date and the latest end date within each group, and count the days between these dates. Finally, sum the results from all groups for each customer.

Let’s take the example from the beginning of this text:

From January 1, 2024, to January 10, 2024 (GROUP 1)
From January 5, 2024, to January 20, 2024 (GROUP 1)
From February 1, 2024, to February 2, 2024 (GROUP 2)

The first two entries belong to the same group because they overlap. The third entry forms a separate group because there is a gap between the previous activity (nothing happened between January 21 and January 31). Therefore, the first two activities are assigned to group 1, and the last one to group 2. Now we just need to count the days between January 1 and January 20, add the two days from February, and voila.

WITH DateAdjusted AS (
    SELECT customer_id,
           start_date,
           CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date
      FROM activity
)
, FlaggedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           CASE
               WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0
               ELSE 1
           END AS NewGroup
      FROM DateAdjusted
)
SELECT * FROM FlaggedIntervals;

The subquery FlaggedIntervals marks activities starting after a break with a one. The other activities are marked with a zero. An interesting element here is the LAG operator in line #12:

LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date)

This operator returns the value of a given column (here: end_date) from the previous record, optionally partitioned by some column. Hence, in the OVER clause, we have both the PARTITION BY and ORDER BY elements.

The whole line looks like this:

WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0

This means that if the end_date of the previous record is greater than the start_date of the current record, we set the NewGroup column to 0 (because the activities overlap), otherwise to 1 (the previous activity ended before the current one started). A beneficial side effect is that the first activity in each group is marked with a one (since the previous record does not exist, it cannot generate a zero, so the ELSE block is triggered).

In the next step, we calculate a running total from these ones:

WITH DateAdjusted AS (
    SELECT customer_id,
           start_date,
           CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date
      FROM activity
)
, FlaggedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           CASE
               WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0
               ELSE 1
           END AS NewGroup
      FROM DateAdjusted
)
, GroupedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID
      FROM FlaggedIntervals
)
SELECT * FROM GroupedIntervals;

Here, the interesting part is line #21:

SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID

The SUM ... OVER ... operator is one of the underrated gems of SQL. It was first implemented in SQL Server 2005. In this particular case, for each record, it calculates the sum of all NewGroup values from the first record in the partition to the current record (note the absence of a GROUP BY operator in this subquery - the summation operation runs continuously for each record within the current customer).

The result of this subquery is very similar to the previous one, but instead of zeros and ones in the NewGroup column, we get consecutively numbered groups in the GroupID column.

Now that we have activities grouped, we can determine the earliest start date and the latest end date in each group:

WITH DateAdjusted AS (
    SELECT customer_id,
           start_date,
           CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date
      FROM activity
)
, FlaggedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           CASE
               WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0
               ELSE 1
           END AS NewGroup
      FROM DateAdjusted
)
, GroupedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID
      FROM FlaggedIntervals
)
, MergedIntervals AS (
    SELECT customer_id,
           MIN(start_date) AS start_date,
           MAX(end_date) AS end_date
      FROM GroupedIntervals
     GROUP BY customer_id, GroupID
)
SELECT * FROM MergedIntervals;

The MergedIntervals subquery returns the list of start and end dates for each group of overlapping activities.

Time for the final step:

WITH DateAdjusted AS (
    SELECT customer_id,
           start_date,
           CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date
      FROM activity
)
, FlaggedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           CASE
               WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0
               ELSE 1
           END AS NewGroup
      FROM DateAdjusted
)
, GroupedIntervals AS (
    SELECT customer_id,
           start_date,
           end_date,
           SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID
      FROM FlaggedIntervals
)
, MergedIntervals AS (
    SELECT customer_id,
           MIN(start_date) AS start_date,
           MAX(end_date) AS end_date
      FROM GroupedIntervals
     GROUP BY customer_id, GroupID
)
SELECT customer_id,
       SUM(DATEDIFF(day, start_date, end_date) + 1) AS total_days
  FROM MergedIntervals
 GROUP BY customer_id;

And why the +1 when summing? Because when subtracting two dates, you need to add one to get the correct count. For example, (February 2) minus (February 1) returns 1, but that’s actually two days.

In this way, we have created a query that does exactly what we need, and it does so quite quickly: on a million records with an average of about ten activities per customer, the above code runs in less than two seconds on my system.

If any readers have encountered a similar problem and managed to solve it in a less complicated way, I would be happy to hear about it.

https://xpil.eu/JyiUi

Dodaj komentarz Anuluj pisanie odpowiedzi