Some time ago, a colleague from the desk next to mine (virtual desk, since everyone is working from home, but a desk is a desk nevertheless) reached out to me for help. He had been struggling with an issue for three days, and I enjoy such puzzles. The problem turned out to be interesting enough that I decided to devote my full attention to it and make a blog entry about it, which you are now reading.
The topic seems trivial at first, but it quickly becomes more complicated.
The task is to calculate how many days each customer spent on various activities. In the simplest terms, we have a customer ID and the start and (optionally) end dates of the activity. Optionally, because some activities are still ongoing and have the end date set to NULL. In such cases, the current date should be used for calculations.
In reality, there are many more columns, but let’s not complicate things unnecessarily.
The catch lies in an additional condition: if a customer has more than one activity on a given day, that day is counted only once. So, for example, if customer A has three activities:
- From January 1, 2024, to January 10, 2024
- From January 5, 2024, to January 20, 2024
- From February 1, 2024, to February 2, 2024
… the number of days should be calculated as:
- The number of days from January 1 to January 20 (20 days)
- Plus the number of days from February 1 to February 2 (2 days)
- Total: 22 days
Simple, right? However, when we start writing the SQL code to calculate this, it quickly becomes evident that the task is not so simple. Especially since some customers have a lot of activities that can overlap in all possible ways. Also, the solution needs to be efficient because the data involves millions of records.
The simplest approach would be to construct a calendar table containing all dates and then count the unique dates within each customer’s activities. However, this approach, while logically correct and relatively simple, does not perform well on large datasets.
Therefore, we need to buckle down and…
We’ll start with the easy part, replacing all NULLs in the end dates with the current date:
WITH DateAdjusted AS ( SELECT customer_id, start_date, CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date FROM activity ) SELECT * FROM DateAdjusted;
What next?
Let’s think: we need to group each customer’s activities into overlapping groups, then find the earliest start date and the latest end date within each group, and count the days between these dates. Finally, sum the results from all groups for each customer.
Let’s take the example from the beginning of this text:
- From January 1, 2024, to January 10, 2024 (GROUP 1)
- From January 5, 2024, to January 20, 2024 (GROUP 1)
- From February 1, 2024, to February 2, 2024 (GROUP 2)
The first two entries belong to the same group because they overlap. The third entry forms a separate group because there is a gap between the previous activity (nothing happened between January 21 and January 31). Therefore, the first two activities are assigned to group 1, and the last one to group 2. Now we just need to count the days between January 1 and January 20, add the two days from February, and voila.
WITH DateAdjusted AS ( SELECT customer_id, start_date, CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date FROM activity ) , FlaggedIntervals AS ( SELECT customer_id, start_date, end_date, CASE WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0 ELSE 1 END AS NewGroup FROM DateAdjusted ) SELECT * FROM FlaggedIntervals;
The subquery FlaggedIntervals
marks activities starting after a break with a one. The other activities are marked with a zero. An interesting element here is the LAG
operator in line #12:
LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date)
This operator returns the value of a given column (here: end_date
) from the previous record, optionally partitioned by some column. Hence, in the OVER
clause, we have both the PARTITION BY
and ORDER BY
elements.
The whole line looks like this:
WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0
This means that if the end_date
of the previous record is greater than the start_date
of the current record, we set the NewGroup
column to 0 (because the activities overlap), otherwise to 1 (the previous activity ended before the current one started). A beneficial side effect is that the first activity in each group is marked with a one (since the previous record does not exist, it cannot generate a zero, so the ELSE
block is triggered).
In the next step, we calculate a running total from these ones:
WITH DateAdjusted AS ( SELECT customer_id, start_date, CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date FROM activity ) , FlaggedIntervals AS ( SELECT customer_id, start_date, end_date, CASE WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0 ELSE 1 END AS NewGroup FROM DateAdjusted ) , GroupedIntervals AS ( SELECT customer_id, start_date, end_date, SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID FROM FlaggedIntervals ) SELECT * FROM GroupedIntervals;
Here, the interesting part is line #21:
SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID
The SUM ... OVER ...
operator is one of the underrated gems of SQL. It was first implemented in SQL Server 2005. In this particular case, for each record, it calculates the sum of all NewGroup
values from the first record in the partition to the current record (note the absence of a GROUP BY
operator in this subquery - the summation operation runs continuously for each record within the current customer).
The result of this subquery is very similar to the previous one, but instead of zeros and ones in the NewGroup
column, we get consecutively numbered groups in the GroupID
column.
Now that we have activities grouped, we can determine the earliest start date and the latest end date in each group:
WITH DateAdjusted AS ( SELECT customer_id, start_date, CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date FROM activity ) , FlaggedIntervals AS ( SELECT customer_id, start_date, end_date, CASE WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0 ELSE 1 END AS NewGroup FROM DateAdjusted ) , GroupedIntervals AS ( SELECT customer_id, start_date, end_date, SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID FROM FlaggedIntervals ) , MergedIntervals AS ( SELECT customer_id, MIN(start_date) AS start_date, MAX(end_date) AS end_date FROM GroupedIntervals GROUP BY customer_id, GroupID ) SELECT * FROM MergedIntervals;
The MergedIntervals
subquery returns the list of start and end dates for each group of overlapping activities.
Time for the final step:
WITH DateAdjusted AS ( SELECT customer_id, start_date, CONVERT(date, COALESCE(end_date, GETDATE())) AS end_date FROM activity ) , FlaggedIntervals AS ( SELECT customer_id, start_date, end_date, CASE WHEN LAG(end_date) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) >= start_date THEN 0 ELSE 1 END AS NewGroup FROM DateAdjusted ) , GroupedIntervals AS ( SELECT customer_id, start_date, end_date, SUM(NewGroup) OVER (PARTITION BY customer_id ORDER BY start_date, end_date) AS GroupID FROM FlaggedIntervals ) , MergedIntervals AS ( SELECT customer_id, MIN(start_date) AS start_date, MAX(end_date) AS end_date FROM GroupedIntervals GROUP BY customer_id, GroupID ) SELECT customer_id, SUM(DATEDIFF(day, start_date, end_date) + 1) AS total_days FROM MergedIntervals GROUP BY customer_id;
And why the +1 when summing? Because when subtracting two dates, you need to add one to get the correct count. For example, (February 2) minus (February 1) returns 1, but that’s actually two days.
In this way, we have created a query that does exactly what we need, and it does so quite quickly: on a million records with an average of about ten activities per customer, the above code runs in less than two seconds on my system.
If any readers have encountered a similar problem and managed to solve it in a less complicated way, I would be happy to hear about it.
Jeżeli chcesz do komentarza wstawić kod, użyj składni:
[code]
tutaj wstaw swój kod
[/code]
Jeżeli zrobisz literówkę lub zmienisz zdanie, możesz edytować komentarz po jego zatwierdzeniu.