Case study: interval manipulation

This tutorial illustrates how staircase can be applied to manipulate intervals. The ideas demonstrated below form the basis for piso, a python package providing set operations for pandas interval classes.

Intervals are not explicitly modelled by staircase, unlike pandas, but we can map intervals to step functions (and vice versa) and staircase can certainly handle step functions! There are several mappings from intervals to step functions which seem reasonable, however the most useful for our purposes will be one where, for a set of intervals, the corresponding step function is f(x) = 1 if the point x belongs to one of the intervals and 0 otherwise. Note that this is only a one-to-one mapping if the intervals, in each set, are disjoint and non-adjacent.

To start, we’ll consider an example working with intervals [3,6) and [5,7), and create corresponding step functions (i.e. staircase.Stairs) a and b respectively.

In [1]: import pandas as pd

In [2]: import staircase as sc

In [3]: a = sc.Stairs(start=3, end=6, closed="left")

In [4]: b = sc.Stairs(start=5, end=7, closed="left")

The following function will be useful to map back to intervals:

In [5]: def stairs_to_intervals(stairs):
   ...:     return (
   ...:         stairs
   ...:         .to_frame()
   ...:         .query("value == 1")
   ...:         .drop(columns="value")
   ...:     )
   ...: 

These intervals clearly overlap between 5 and 6, and we can create the union, with the step functions, using a “logical or” operation:

In [6]: result = a | b  # result will be a Stairs instance

In [7]: stairs_to_intervals(result)
Out[7]: 
  start end
1     3   7

Similarly, the intersection can be calculated with a “logical and”:

In [8]: result = a & b  # result will be a Stairs instance

In [9]: stairs_to_intervals(result)
Out[9]: 
  start end
1     5   6

For set difference, we can literally translate “in a, and not in b” with staircase.Stairs like so:

In [10]: result = a & ~b  # result will be a Stairs instance

In [11]: stairs_to_intervals(result)
Out[11]: 
  start end
1     3   5

The same result for set difference can be achieved by using b as a mask (which creates undefined values where b in non-zero) then filling the undefined values of the step function with 0. It can also be achieved by multiplying instead of using staircase.Stairs.logical_and():

In [12]: result = a.mask(b).fillna(0)

or

In [13]: result = a * ~b

Let’s say we want to create the union of many intervals. There are two approaches, the first of which is to create a step function for every interval and repeatedly apply the staircase.Stairs.logical_or() function:

In [14]: from functools import reduce

In [15]: intervals = pd.arrays.IntervalArray.from_tuples([(0, 1), (1, 3), (2, 4), (5, 7)])

In [16]: result = reduce(sc.Stairs.logical_or, [sc.Stairs(start=i.left, end=i.right) for i in intervals])

In [17]: stairs_to_intervals(result)
Out[17]: 
  start end
1     0   4
3     5   7

The second approach is to add the step functions together (and values where the intervals overlap will have values of 2 or more), and then set non-zero values of the step function to 1. This approach is favourable as it allows us to pass in vectors of start and end values for the intervals into the constructor of a single staircase.Stairs object.

In [18]: result1 = sc.Stairs(start=intervals.left, end=intervals.right)

In [19]: result1.to_frame()
Out[19]: 
  start  end  value
0  -inf    0      0
1     0    2      1
2     2    3      2
3     3    4      1
4     4    5      0
5     5    7      1
6     7  inf      0

From here we can just use a relational operator to get a boolean valued step function:

In [20]: stairs_to_intervals(result1 > 0)
Out[20]: 
  start end
1     0   4
3     5   7

We could have also used result1 != 0 or result1.make_boolean() to convert to the binary valued step function.

Converting this step function back to an interval array can be done by using the start and end columns of the dataframe as arguments to pandas.arrays.IntervalArray.from_arrays().