Case study: queue analysis

This case study illustrates the use of the staircase package for queue analysis. In this example vessels (i.e. ships) arrive offshore and await their turn to enter a harbour where they will be loaded with cargo. We will examine the queue, which is composed of all vessels which is offshore but yet to enter the harbour, for the year 2020.

We begin by importing the queue data into a pandas.DataFrame where each row corresponds to a vessel. The first column gives the time at which the vessel arrives offshore (enters the queue), and the second column gives the time at which the vessel enters the harbour (leaves the queue). A NaT value in either of these columns indicates the vessel entered the queue prior to 2020, or left the queue after 2020, however this approach does not require these values to be NaT. The third column gives the weight, in tonnes, of cargo destined for the vessel. Note, for this approach we require every vessel, that was in the queue at some point in 2020, to appear in the dataframe.

In [1]: import pandas as pd

In [2]: import staircase as sc

In [3]: import matplotlib.pyplot as plt

In [4]: data = pd.read_csv(vessel_data, parse_dates=["enter", "leave"], dayfirst=True)

In [5]: data
Out[5]: 
                   enter               leave  tonnes
0                    NaT 2020-01-01 04:40:00  129000
1                    NaT 2020-01-01 22:18:00   69055
2                    NaT 2020-01-01 11:47:00  138000
3                    NaT 2020-01-02 10:12:00   84600
4                    NaT 2020-01-01 22:39:00  142550
...                  ...                 ...     ...
1224 2020-12-29 05:59:00                 NaT  142500
1225 2020-12-29 17:41:00                 NaT   84600
1226 2020-12-30 12:41:00                 NaT  119200
1227 2020-12-30 16:59:00                 NaT  113200
1228 2020-12-30 17:59:00                 NaT  142500

[1229 rows x 3 columns]

A step function, which quantifies the size of the queue, can be created with vector data corresponding to start and end times for the state being modelled. In this case, where the state is “in the queue”, the required data is the columns “enter” and “leave”. Note that since we want to examine 2020 we clip the step function at the year endpoints, making the functions undefined for any time outside of 2020 (see Gotchas for why this is a good idea).

In [6]: queue = sc.Stairs(frame=data, start="enter", end="leave")

In [7]: queue = queue.clip(pd.Timestamp("2020"), pd.Timestamp("2021"))

In [8]: queue.plot();
../_images/case_study_queue.png

From the chart it would appear the size of the queue at the beginning of 2020 is around 10 vessels. This can be confirmed by calling the step function with this time as an argument - a deliberate choice to emulate the notation for evaluating a function at a particular point:

In [9]: queue(pd.Timestamp("2020-01-01"))
Out[9]: 10.0

Assuming that no vessels arrive precisely at midnight on the 1st of Jan, we can expect the number of vessels in the queue at this time to be equal to the number of NaT values in the “enter” column:

In [10]: data["enter"].isna().sum()
Out[10]: 10

There are a number of ways we can characterise the distribution of the queue size values. The staircase.Stairs.hist() method creates histogram data as a pandas.Series, indexed by a pandas.IntervalIndex. If the bins parameter is not supplied then the method will use unit bins which cover the range of the step function values:

In [11]: queue_hist = queue.hist(stat="probability")

In [12]: queue_hist
Out[12]: 
[0, 1)      0.090418
[1, 2)      0.043205
[2, 3)      0.029688
[3, 4)      0.039870
[4, 5)      0.048759
[5, 6)      0.084548
[6, 7)      0.085024
[7, 8)      0.093790
[8, 9)      0.078144
[9, 10)     0.048905
[10, 11)    0.053679
[11, 12)    0.062805
[12, 13)    0.060775
[13, 14)    0.051981
[14, 15)    0.036320
[15, 16)    0.027256
[16, 17)    0.026677
[17, 18)    0.017746
[18, 19)    0.007252
[19, 20)    0.002284
[20, 21)    0.003338
[21, 22)    0.005897
[22, 23)    0.001510
[23, 24)    0.000127
dtype: float64

Given the queue length is integer variable the pandas.IntervalIndex can be replaced to produce a simpler plot:

In [13]: queue_hist.index = queue_hist.index.left

In [14]: ax = queue_hist.plot.bar()

In [15]: ax.set_xlabel("queue size", fontsize="12")
Out[15]: Text(0.5, 0, 'queue size')

In [16]: ax.set_ylabel("probability", fontsize="12")
Out[16]: Text(0, 0.5, 'probability')
../_images/case_study_queue_hist_bar.png

Another useful queue metric in the context of this case study is “queue tonnes”. This metric is calculated as the sum of the cargo tonnes destined for vessels in the queue. Since the queue is a step function, queue tonnes must also be a step function. The creation of a queue tonnes step function is similar to that used for the queue, but requires the vector of cargo tonnes to be used in construction. The value parameter, which defaults to a vector of ones, indicates how much the step function should increase or decrease whenever the corresponding vessels enter or leave the queue:

In [17]: queue_tonnes = sc.Stairs(frame=data, start="enter", end="leave", value="tonnes")

In [18]: queue_tonnes = queue_tonnes.clip(pd.Timestamp("2020"), pd.Timestamp("2021"))

In [19]: queue_tonnes.plot();
../_images/case_study_queue_tonnes.png

Before diving deeper into distributions we tackle a variety of miscellaneous questions for the purposes of demonstration.

What was the maximum queue tonnes in 2020?

In [20]: queue_tonnes.max()
Out[20]: 2138466.0

What was the average queue tonnes in 2020?

In [21]: queue_tonnes.mean()
Out[21]: 817852.320958561

What fraction of the year was the queue_tonnes larger than 1.5 million tonnes?

In [22]: (queue_tonnes > 1.5e6).mean()
Out[22]: 0.09867751973284761

What was the median queue tonnes for March?

In [23]: queue_tonnes.clip(pd.Timestamp("2020-3"), pd.Timestamp("2020-4")).median()
Out[23]: 1281300.0

What is the 80th percentile for queue tonnes? .. ipython:: python

queue_tonnes.percentile(80)

What is the 40th, 65th, 77th and 90th percentiles for queue tonnes?

In [24]: queue_tonnes.percentile([40,65,77,90])
Out[24]: array([ 659500., 1035400., 1220600., 1496800.])

Aside from being able to evaluate any number of percentiles, we can plot the “percentile function”:

In [25]: queue_tonnes.percentile.plot()
Out[25]: <AxesSubplot:>

The percentile function is essentially the inverse of an empirical cumulative distribution function. We can generate an ecdf for the step function values too:

In [26]: queue_tonnes.ecdf.plot()
Out[26]: <AxesSubplot:>
../_images/case_study_queue_ecdf.png

Earlier we found that the queue tonnes were strictly above 1.5Mt about 9.87% of the time. The ecdf can tell us what fraction of the time the queue tonnes was less than or equal to 1.5Mt. (We’d expect these results to add to 1).

In [27]: queue_tonnes.ecdf(1.5e6)
Out[27]: 0.9013224802671552

We can also then discover the fraction of time that 1Mt < queue tonnes <= 1.5Mt like so:

In [28]: queue_tonnes.ecdf(1.5e6) - queue_tonnes.ecdf(1e6)
Out[28]: 0.2707840012143298

This piece of data is tantamount to a single bin in histogram data. Earlier we used the staircase.Stairs.hist function on the queue length, with default bin sizes. Since the range of the queue tonnes values is much larger it will be useful to specify the bins (defined by the edges) to generate a histogram for the queue tonnes.

In [29]: queue_tonnes_hist = queue_tonnes.hist(bins = range(0, 2100000, 100000), stat="probability")

In [30]: queue_tonnes_hist
Out[30]: 
[0, 100000)           0.104360
[100000, 200000)      0.040817
[200000, 300000)      0.029093
[300000, 400000)      0.040346
[400000, 500000)      0.057899
[500000, 600000)      0.069479
[600000, 700000)      0.094911
[700000, 800000)      0.080645
[800000, 900000)      0.062583
[900000, 1000000)     0.050406
[1000000, 1100000)    0.059751
[1100000, 1200000)    0.064855
[1200000, 1300000)    0.066940
[1300000, 1400000)    0.045719
[1400000, 1500000)    0.033519
[1500000, 1600000)    0.027588
[1600000, 1700000)    0.023567
[1700000, 1800000)    0.014984
[1800000, 1900000)    0.016359
[1900000, 2000000)    0.011215
dtype: float64

As we did earlier for the queue length histogram, we can quickly create a chart with pandas Series plotting functions.

In [31]: ax = queue_tonnes_hist.plot.bar()

In [32]: ax.set_xlabel("queue size in tonnes", fontsize="12")
Out[32]: Text(0.5, 0, 'queue size in tonnes')

In [33]: ax.set_ylabel("probability", fontsize="12")
Out[33]: Text(0, 0.5, 'probability')

In [34]: plt.tight_layout()
../_images/case_study_queue_tonnes_hist_bar.png

Although there is plenty of analysis achieveable through step functions, it may be desirable to aggregate the data to a traditional time series - a daily mean for example. This can be achieved with SLICING

In [35]: day_range = pd.date_range("2020", freq="D", periods=366)

In [36]: queue_tonnes.slice(day_range).mean()
Out[36]: 
[2020-01-01, 2020-01-02)    1.100435e+06
[2020-01-02, 2020-01-03)    1.052939e+06
[2020-01-03, 2020-01-04)    7.316999e+05
[2020-01-04, 2020-01-05)    4.992351e+05
[2020-01-05, 2020-01-06)    5.410560e+05
                                ...     
[2020-12-26, 2020-12-27)    1.807232e+06
[2020-12-27, 2020-12-28)    1.729369e+06
[2020-12-28, 2020-12-29)    1.680614e+06
[2020-12-29, 2020-12-30)    1.870018e+06
[2020-12-30, 2020-12-31)    1.733728e+06
Length: 365, dtype: float64