Cookbook¶

The following cookbook examples assume the following setup

>>> import staircase as sc
>>> import pandas as pd
>>> import numpy as np

General recipes¶

DataFrame - groupby - apply -> Series¶

Given a pandas dataframe, whose columns include

arguments for staircase.Stairs.layer()
one, or more, categorical columns which group step functions

a pandas Series of Stairs instances, indexed by groupbys can be obtained like so:

>>> df = sc.make_test_data(groups=list("abc"))
>>> df.groupby("group").apply(sc.Stairs, "start", "end", "value")
group
a    <staircase.Stairs, id=1931375056736>
b    <staircase.Stairs, id=1931366000144>
c    <staircase.Stairs, id=1931373976352>
dtype: object

Merging overlapping events¶

Suppose a collection of events is defined by start times, and end times, and we wish to merge overlapping events.

>>> # dummy data
>>> starts = np.sort(np.random.uniform(0, 100, 40))
>>> events = pd.DataFrame(
...     {
...         "start":starts,
...         "end": starts + np.random.uniform(0, 4, 40),
...     }
... )

>>> events.head()
start       end
0  0.174828  1.538377
1  0.636105  2.492274
2  2.251498  5.173393
3  5.596381  8.660455
4  7.900132  9.360358

>>> merged_events = (
...     sc.Stairs(events, "start", "end")
...     .make_boolean()
...     .to_frame()
...     .query("value == 1")
...     .drop(columns="value")
... )

>>> merged_events.head()
start        end
1   0.174828   5.173393
3   5.596381   9.360358
5   16.99004  20.552528
7  22.574905  27.741393
9  30.857052  31.043336

Merging overlapping events with gap below threshold¶

A variant of the above problem, suppose a collection of events is defined by start times, and end times, and we wish to merge overlapping events, or events where the gap between them is less than a certain threshold.

>>> # dummy data
>>> starts = np.sort(np.random.uniform(0, 100, 40))
>>> events = pd.DataFrame(
...     {
...         "start":starts,
...         "end": starts + np.random.uniform(0, 4, 40),
...     }
... )

>>> events.head()
start       end
0  0.174828  1.538377
1  0.636105  2.492274
2  2.251498  5.173393
3  5.596381  8.660455
4  7.900132  9.360358

>>> threshold = 1

>>> merged_events = (
>>>     sc.Stairs(events, "start", "end")
>>>     .make_boolean()
>>>     .to_frame()
>>>     .iloc[1:-1]
>>>     .eval("duration = end - start")
>>>     .query("value == 1 or duration < @threshold")
>>>     .pipe(sc.Stairs, "start", "end")
>>>     .to_frame()
>>>     .query("value == 1")
>>>     .drop(columns="value")
>>> )

>>> merged_events.head()
start        end
1   0.174828   9.360358
3   16.99004  20.552528
5  22.574905  27.741393
7  30.857052  37.126433
9  38.199949   43.50357

Fill undefined intervals of one step function with another¶

staircase.Stairs.fillna() allows undefined intervals in a step function to be redefined (i.e. “filled”) with a number. This recipe shows a simple one-liner which fills the undefined values of step function a with the values of step function b.

>>> # test data
>>> def gen_test_step_function(seed):
...     return (
...         sc.make_test_data(dates=False, seed=seed)
...         .pipe(sc.Stairs, "start", "end")
...     )
...
>>> a = gen_test_step_function(0).mask((20,30)).mask((80,90))
>>> b = gen_test_step_function(1)

>>> # recipe
>>> result = a.fillna(0) + b.where(a.isna()).fillna(0)

>>> # plot
>>> fig, axes = plt.subplots(ncols=3, figsize=(8,3), sharex=True, sharey=True)
>>> a.plot(axes[0])
>>> axes[0].set_title("a")
>>> b.plot(axes[1])
>>> axes[1].set_title("b")
>>> result.plot(axes[2])
>>> axes[2].set_title("result")

Stitch two step functions together at a point¶

>>> # test data
>>> def gen_test_step_function(seed):
...     return (
...         sc.make_test_data(dates=False, seed=seed)
...         .pipe(sc.Stairs, "start", "end")
...     )
...
>>> a = gen_test_step_function(0).mask((20,30))
>>> b = gen_test_step_function(1).mask((80,90))
>>> stitch_point = 50

>>> # recipe

>>> # record undefined intervals
>>> a_isna = a.isna().clip(None, stitch_point).fillna(0)
>>> b_isna = b.isna().clip(stitch_point, None).fillna(0)

>>> # stitch together
>>> stitched = (
...     a.clip(None, stitch_point).fillna(0).mask(a_isna)
...     +
...     b.clip(stitch_point, None).fillna(0).mask(b_isna)
... )

>>> # plot
>>> fig, axes = plt.subplots(ncols=3, figsize=(8,3), sharex=True, sharey=True)
>>> a.plot(axes[0])
>>> axes[0].set_title("a")
>>> b.plot(axes[1])
>>> axes[1].set_title("b")
>>> stitched.plot(axes[2])
>>> axes[2].set_title("stitched")

Datetime recipes¶

Convert step function to time series¶

Suppose we have a step function sf that we want to convert to a pandas.Series representing a timeseries. In this recipe, we calculate a time series from the daily means and set the index of the Series to be the a pandas.DatetimeIndex.

>>> sf = sc.make_test_data().pipe(sc.Stairs, "start", "end")
>>> days = pd.period_range("2021", periods=365, freq="D")
>>> time_series = sf.slice(days).mean()
>>> time_series.index = days

Step function representing weekends¶

In this recipe we’ll create a boolean valued step function which is 1 whenever it is a weekend in 2021, and 0 otherwise. Note, the first Saturday in 2021 was the 2nd of January.

>>> saturdays = pd.date_range("2021-01-02", "2022", freq="7D", closed="left")
>>> mondays = saturdays + pd.Timedelta(2, "day")
>>> weekend_stairs = sc.Stairs(start=saturdays, end=mondays)
>>> weekend_stairs.plot()

Step function representing 9am to 5pm every day¶

In this recipe we’ll create a boolean valued step function which is 1 whenever it is between 9am and 5pm (in 2021), and 0 otherwise.

nine_am = pd.date_range("2021-1-1 09:00", "2022", closed="left")
five_pm = pd.date_range("2021-1-1 17:00", "2022", closed="left")
nine_five_stairs = sc.Stairs(start=nine_am, end=five_pm)

Step function representing business hours¶

In the previous two recipes we created

a step function weekend_stairs which was 1 during weekends, and 0 otherwise
a step function nine_five_stairs which was 1 between 9am to 5pm, and 0 otherwise

If we assume business hours are 9am to 5am, on weekdays then the desired step function is achieved with any of the four calculations:

business_hours_stairs = nine_five_stairs.mask(weekends)

business_hours_stairs = nine_five_stairs.where(~weekends)

business_hours_stairs = nine_five_stairs * ~weekends

business_hours_stairs = nine_five_stairs & ~weekends

Success rates over time¶

Suppose we have a set of events, associated with a time and a boolean (success or not). This recipe creates a step function which represents average success rate over time, calculated over 1000 events occurring during the year 2021.

>>> # test data
... def gen_success_rates():
...     arr = np.array([])
...     for i in range(10):
...         av_success_rate = np.random.uniform()
...         arr = np.append(
...             arr,
...             np.random.choice([False, True], 100, p=[1-av_success_rate, av_success_rate]),
...         )
...     return arr
...
>>> times = (
...     pd.Timestamp("2021") +
...     pd.Series(np.random.randint(0,365*24, 1000)).apply(pd.Timedelta, unit="H")
... )
...
>>> events = pd.DataFrame(
...     {
...         "time": np.sort(times),
...         "success": gen_success_rates(),
...     }
... )

>>> # recipe
>>> count_successful = sc.Stairs(events.query("success == 1"), start="time")
>>> count_all = sc.Stairs(events, start="time")
>>> success_rate = count_successful/count_all
>>> success_rate.plot()

Average over time¶

The following recipe is a generalisation of the above recipe for success rates, and does not introduce anything fundamentally new.

Suppose we have a set of events, associated with a time and a number. This recipe creates a step function which represents the average over time, calculated over 1000 events occurring during the year 2021.

>>> # test data
>>> rng = np.random.default_rng(seed=0)  # seed random number generator
>>> def gen_values():
...    arr = np.array([])
...    for i in range(10):
...        bound = rng.integers(0,100)
...        bounds = (bound, 100) if bound < 50 else (0, bound)
...        arr = np.append(
...            arr,
...            rng.integers(*bounds, 100)
...        )
...    return arr
...
>>> times = (
...     pd.Timestamp("2021") +
...     pd.Series(rng.integers(0,365*24, 1000)).apply(pd.Timedelta, unit="H")
... )
...
>>> events = pd.DataFrame(
...     {
...         "time": np.sort(times),
...         "value": gen_values(),
...     }
... )

>>> # recipe
>>> sum_over_time = sc.Stairs(events, start="time", value="value")
>>> count_over_time = sc.Stairs(events, start="time")
>>> average_over_time = sum_over_time/count_over_time
>>> average_over_time.plot()

Rolling average over time (trailing window n events)¶

Suppose we have a set of events, associated with a time and a number. This recipe creates a step function which represents the rolling average over time, calculated over 1000 events occurring during the year 2021. The rolling average is calculated with a trailing window which averages the latest n events. In the recipe below n = 50.

>>> # test data
>>> rng = np.random.default_rng(seed=0)  # seed random number generator
>>> def gen_values():
...    arr = np.array([])
...    for i in range(10):
...        bound = rng.integers(0,100)
...        bounds = (bound, 100) if bound < 50 else (0, bound)
...        arr = np.append(
...            arr,
...            rng.integers(*bounds, 100)
...        )
...    return arr
...
>>> times = (
...     pd.Timestamp("2021") +
...     pd.Series(rng.integers(0,365*24, 1000)).apply(pd.Timedelta, unit="H")
... )
...
>>> events = pd.DataFrame(
...     {
...         "time": np.sort(times),
...         "value": gen_values(),
...     }
... )

>>> # recipe
>>> n = 50
>>> end = events["time"].shift(-n)
>>> rolling_sum_over_time = sc.Stairs(events, start="time", end=end, value="value")
>>> rolling_count_over_time = sc.Stairs(events, start="time", end=end)
>>> rolling_average_over_time = rolling_sum_over_time/rolling_count_over_time
>>> rolling_average_over_time.plot()

Rolling average over time (trailing window, time based)¶

Suppose we have a set of events, associated with a time and a number. This recipe creates a step function which represents the rolling average over time, calculated over 1000 events occurring during the year 2021. The rolling average is calculated with a trailing time based window. The window in the recipe below is 28 days.

>>> # test data
>>> rng = np.random.default_rng(seed=0)  # seed random number generator
>>> def gen_values():
...    arr = np.array([])
...    for i in range(10):
...        bound = rng.integers(0,100)
...        bounds = (bound, 100) if bound < 50 else (0, bound)
...        arr = np.append(
...            arr,
...            rng.integers(*bounds, 100)
...        )
...    return arr
...
>>> times = (
...     pd.Timestamp("2021") +
...     pd.Series(rng.integers(0,365*24, 1000)).apply(pd.Timedelta, unit="H")
... )
...
>>> events = pd.DataFrame(
...     {
...         "time": np.sort(times),
...         "value": gen_values(),
...     }
... )

>>> # recipe
>>> end = events["time"] + pd.Timedelta(28, "D")
>>> rolling_sum_over_time = sc.Stairs(events, start="time", end=end, value="value")
>>> rolling_count_over_time = sc.Stairs(events, start="time", end=end)
>>> rolling_average_over_time = rolling_sum_over_time/rolling_count_over_time
>>> rolling_average_over_time.plot()

Frequently asked questions

API reference