Skip to content

From SQL to SPL:Calculate the number of intersections between adjacent subsets after grouping

esProcSPL edited this page Apr 30, 2025 · 1 revision

A certain database table records the execution status of a project, with multiple people participating in the project every day, and one person can participate in multiple tasks of the project in one day.

EMP_ID EMP_DATE EMP_TASK
A1 04-01-2024 345
A2 04-01-2024 546
A3 04-01-2024 232
A4 04-01-2024 8000
A5 04-01-2024 2344
A1 04-02-2024 456
A2 04-02-2024 9280
A3 04-02-2024 324
A2 04-02-2024 754
A8 04-02-2024 75
A2 04-03-2024 400
A3 04-03-2024 234
A3 04-04-2024 100

Now we need to calculate: how many people have also participated in the project the previous day for every day. That is, after grouping by date, calculate the number of intersections between daily and previous day's personnel. The first day is special, assuming that everyone participated in the project the previous day.

EMP_DATE COUNT
04-01-2024 5
04-02-2024 3
04-03-2024 2
04-04-2024 1

We can first group by date, and then perform intersection operations on the grouped subsets, which requires retaining the subsets after grouping. But after SQL grouping, it must aggregate immediately, and subsets cannot be retained, and calculating intersection is also impossible. It needs to do it another way, to group by person first and determine whether each person appeared on a certain date and the previous day, and then group and aggregate these dates, it involves multiple layers of nesting and window functions, which is very troublesome.

After grouping in SPL, subsets can be retained and adjacent subsets can be referenced. Code can be written directly according to the idea.

A
1 =dbConn.query("select distinct EMP_ID,EMP_DATE from tb")
2 =A1.group(EMP_DATE)
3 =A2.new(EMP_DATE,if(#==1,~.len(),(~.(EMP_ID)^~[-1].(EMP_ID)).len()):COUNT)

A1: Load data from the database and deduplicate EMP-ID.

A2: Group by date, but do not aggregate.

A3: Create a new two-dimensional table based on the grouping results. If the current group is Group 1, directly return the number of members in the group; If it is not the first group, then calculate the intersection of the EMP-IDs of the current group and the previous group, and then calculate the number of members. ^ is used for calculating intersection, [-1] represents the previous group.

Clone this wiki locally