Crowd detection evaluation#

This notebook aims at showing what kind of graph you can draw thank’s to Lours crowd detection evaluator, as special case of detection evaluation

[1]:
%load_ext autoreload

%autoreload 2
import warnings

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from lours.dataset import from_coco_keypoints
from lours.evaluation.detection import CrowdDetectionEvaluator as cde

warnings.simplefilter(action="ignore", category=FutureWarning)

Loading the dataset and the predictions#

Note that they are both treated as datasets at first, and only when creating the eval object we have a detection evaluator

As a second Note, you can add several prediction datasets at the same time

[2]:
crowd_gt = from_coco_keypoints(
    "../../test_lours/test_data/coco_eval/instances_crowd.json"
)
crowd_preds = from_coco_keypoints(
    "../../test_lours/test_data/coco_eval/instances_crowd_predictions.json"
)
evaluator = cde(groundtruth=crowd_gt, predictions=crowd_preds)
[3]:
evaluator

Computing count error metrics#

Here, we compute count error with absolute error (in number of person detected) and relative error (relative to the actual number of person).

  • Absolute error is computed with Mean Absolute Error (MAE), Root of Mean Square Error (RMSE), standard deviation of absolute error (std) and different quantiles of errors. the quantile at 50% (q0.50) is also known as the median of errors. If a model is well behaved, the minimum MAE should be at the same threshold as when the median of absolute error is 0.

  • Relative error is computed the same way. Mean Relative Error (MRE) and Root of Mean Square Relative Error (RMSRE) replace MAE and RMSE.

More formally, for a dataset \(D\) of \(N\) images, with each image \(I_i\) having a count \(c_{I_i}\), and the corresponding prediction count \(\widehat{c}(I_i, t)\) with a model \(\widehat{c}\) and a detection threshold \(t\). The count is thus the number of detections in a particular image such that their confidence if above the threshold \(t\).

We then get the following formulae:

\[MAE(\widehat{c}, t) = \frac{1}{N}\sum_{i = 0}^N\left|c_{I_i} - \widehat{c}(I_i, t)\right| = \mathbb{E}_{I \sim D}\left|c_I - \widehat{c}(I, t)\right|\]
\[MAE(\widehat{c}) = min_{t} MAE(\widehat{c}, t)\]

Similarly with other metrics :

\[MRE(\widehat{c}, t) = \frac{1}{N}\sum_{i = 0}^N\left|\frac{c_{I_i} - \widehat{c}(I_i, t)}{c_{I_i}}\right|\]
\[RMSE(\widehat{c}, t) = \sqrt{\frac{1}{N}\sum_{i = 0}^N\left(c_{I_i} - \widehat{c}(I_i, t)\right)^2}\]
\[RMSRE(\widehat{c}, t) = \sqrt{\frac{1}{N}\sum_{i = 0}^N\left(\frac{c_{I_i} - \widehat{c}(I_i, t)}{c_{I_i}}\right)^2}\]

For quantiles, we compute the quantiles of algebraic error / relative error distribution : for a share \(\alpha \in [0,1]\), we compute the count error \(e\) below which \(\alpha\) of the count predictions on images used for evaluation were for the pair \(\widehat{c}, t\)

\[q\alpha(\widehat{c}, t) = q(\widehat{c}, t, \alpha) = e \text{ such that } \mathop{\mathbb{E}}_{I \sim D}\delta_{\mathbb{R}^+}\left(c_{I} - \widehat{c}(I, t) - e\right) = \alpha\]
\[qR\alpha(\widehat{c}, t) = qR(\widehat{c}, t, \alpha) = e \text{ such that } \mathop{\mathbb{E}}_{I \sim D}\delta_{\mathbb{R}^+}\left(\frac{c_{I} - \widehat{c}(I, t)}{c_{I}} - e\right) = \alpha\]

where \(\delta_{\mathbb{R}^+}\) is the characteristic function of \(\mathbb{R}^+\)

\[\begin{split}\begin{array}{rccc}\delta_{\mathbb{R}^+}: & \mathbb{R} & \rightarrow & \{0,1\} \\ & x & \mapsto & \left\{\begin{array}{l} 1 \text{ if } x > 0 \\ 0 \text{ else} \end{array}\right. \end{array}\end{split}\]

In addition to these curves, we get the detailed errors, that is the error per image, per confidence threshold. This will help us doing statistics with seaborn.

[4]:
curves, detailed_errors = evaluator.compute_count_error(
    groups=(), quantiles=np.linspace(0.1, 0.9, 7)
)

Let’s display the relative and absolute error tables

[5]:
curves["relative"]
[5]:
MRE RMSRE std q0.10 q0.23 q0.37 q0.50 q0.63 q0.77 q0.90 model
confidence
0.00 0.649522 0.730219 0.335279 0.279273 0.413445 0.513316 0.613915 0.704724 0.806905 1.098196 predictions
0.01 0.649522 0.730219 0.335279 0.279273 0.413445 0.513316 0.613915 0.704724 0.806905 1.098196 predictions
0.02 0.649522 0.730219 0.335279 0.279273 0.413445 0.513316 0.613915 0.704724 0.806905 1.098196 predictions
0.03 0.649522 0.730219 0.335279 0.279273 0.413445 0.513316 0.613915 0.704724 0.806905 1.098196 predictions
0.04 0.649522 0.730219 0.335279 0.279273 0.413445 0.513316 0.613915 0.704724 0.806905 1.098196 predictions
... ... ... ... ... ... ... ... ... ... ... ...
0.96 0.537497 0.556097 0.142859 -0.699791 -0.651725 -0.598796 -0.549580 -0.504594 -0.426347 -0.333964 predictions
0.97 0.593507 0.611315 0.146721 -0.755856 -0.709297 -0.661932 -0.619557 -0.561183 -0.484848 -0.392799 predictions
0.98 0.667382 0.683765 0.149028 -0.836331 -0.781949 -0.749009 -0.697289 -0.647355 -0.561885 -0.459282 predictions
0.99 0.772887 0.786162 0.144101 -0.916777 -0.880620 -0.853753 -0.810717 -0.775103 -0.690856 -0.565698 predictions
1.00 1.000000 1.000000 0.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 predictions

101 rows × 11 columns

[6]:
curves["absolute"]
[6]:
MAE RMSE std q0.10 q0.23 q0.37 q0.50 q0.63 q0.77 q0.90 model
confidence
0.00 364.906667 531.806337 387.576233 60.0 111.000000 172.633333 238.0 337.833333 476.233333 796.5 predictions
0.01 364.906667 531.806337 387.576233 60.0 111.000000 172.633333 238.0 337.833333 476.233333 796.5 predictions
0.02 364.906667 531.806337 387.576233 60.0 111.000000 172.633333 238.0 337.833333 476.233333 796.5 predictions
0.03 364.906667 531.806337 387.576233 60.0 111.000000 172.633333 238.0 337.833333 476.233333 796.5 predictions
0.04 364.906667 531.806337 387.576233 60.0 111.000000 172.633333 238.0 337.833333 476.233333 796.5 predictions
... ... ... ... ... ... ... ... ... ... ... ...
0.96 299.643333 417.398734 291.063514 -629.6 -383.033333 -279.366667 -214.0 -158.266667 -108.766667 -65.9 predictions
0.97 336.140000 472.472765 332.579480 -699.3 -422.933333 -319.366667 -227.5 -174.266667 -119.533333 -72.9 predictions
0.98 382.430000 540.884285 383.134980 -784.3 -487.700000 -354.100000 -253.5 -193.633333 -136.533333 -82.4 predictions
0.99 444.793333 631.669259 449.263549 -933.4 -571.000000 -404.933333 -299.0 -228.266667 -165.000000 -94.9 predictions
1.00 542.356667 740.684888 505.286451 -1142.6 -697.933333 -499.100000 -378.0 -285.900000 -224.300000 -141.0 predictions

101 rows × 11 columns

And now the detailed errors table which is much larger, but can be used to reconstruct the aforementioned error tables

[7]:
detailed_errors
[7]:
count gt_count error rel_error abs_error abs_rel_error sq_error sq_rel_error model
image_id confidence
0 0.00 958.0 419 539.0 1.286396 539.0 1.286396 290521.0 1.654815 predictions
0.01 958.0 419 539.0 1.286396 539.0 1.286396 290521.0 1.654815 predictions
0.02 958.0 419 539.0 1.286396 539.0 1.286396 290521.0 1.654815 predictions
0.03 958.0 419 539.0 1.286396 539.0 1.286396 290521.0 1.654815 predictions
0.04 958.0 419 539.0 1.286396 539.0 1.286396 290521.0 1.654815 predictions
... ... ... ... ... ... ... ... ... ... ...
299 0.96 112.0 261 -149.0 -0.570881 149.0 0.570881 22201.0 0.325905 predictions
0.97 101.0 261 -160.0 -0.613027 160.0 0.613027 25600.0 0.375802 predictions
0.98 80.0 261 -181.0 -0.693487 181.0 0.693487 32761.0 0.480924 predictions
0.99 56.0 261 -205.0 -0.785441 205.0 0.785441 42025.0 0.616917 predictions
1.00 0.0 261 -261.0 -1.000000 261.0 1.000000 68121.0 1.000000 predictions

30300 rows × 9 columns

Using Seaborn with detailed errors#

The following graph can you grasp what quantiles are related to.

The graph shows an histogram of algebraic error counts throughout the whole dataset, for the confidence thresholds 0.5 and 0.3

[8]:
sns.histplot(
    detailed_errors.xs(0.5, level="confidence")["error"],
    cumulative=True,
    label="cumulative histogram",
    fill=False,
    stat="percent",
    element="step",
    color="red",
    alpha=0.5,
)
dist = sns.histplot(
    detailed_errors.xs(0.5, level="confidence")["error"],
    label="histogram",
)
ymin, ymax = dist.get_ylim()
quantiles = curves["absolute"].loc[0.5, ["q0.10", "q0.50", "q0.90"]]
for name, q in quantiles.items():
    plt.plot([q, q], [ymin, ymax], label=name, linestyle="dashed")
plt.ylim(ymin, ymax)
plt.grid(axis="y", color="0.95")
dist.set_axisbelow(True)
plt.yticks(range(0, 110, 10))
plt.legend()
plt.title("Algebraic count error distribution and quantiles for threshold $t = 0.5$")
plt.show()
../_images/notebooks_4_demo_evaluation_crowd_13_0.png

Next Cell will plot the same distribution but as a 2D heatmap, with y axis set to the threshold. Each vertical slice will give the graph above.

You can notice some image outliers where the detector consistently overestimate or underestimate the count by a very large margin, that might be interesting to check out.

[9]:
ax = sns.histplot(detailed_errors, x="confidence", y="error", bins=100, cbar=True)
quantiles = curves["absolute"][["q0.10", "q0.50", "q0.90"]]
quantiles.plot(ax=ax)
[9]:
<Axes: xlabel='confidence', ylabel='error'>
../_images/notebooks_4_demo_evaluation_crowd_15_1.png

Same 2D hist plot can be made with absolute error instead of algebraic error

[10]:
ax = sns.histplot(detailed_errors, x="confidence", y="abs_error", bins=100, cbar=True)
curves["absolute"][["MAE", "RMSE"]].plot(ax=ax)
[10]:
<Axes: xlabel='confidence', ylabel='abs_error'>
../_images/notebooks_4_demo_evaluation_crowd_17_1.png

Finally, we can simply use the lineplot function to plot the error distribution across thresholds. The “pi” error bar is for percentage interval. See more info in seaborn documentation

[11]:
sns.lineplot(
    detailed_errors, x="confidence", y="error", errorbar=("pi", 80), label="percentage"
)
sns.lineplot(detailed_errors, x="confidence", y="error", errorbar="sd", label="std")
plt.legend()
[11]:
<matplotlib.legend.Legend at 0x3198eb410>
../_images/notebooks_4_demo_evaluation_crowd_19_1.png

Manual plotting#

Here, we use directly matplotlib and pandas to plot the different curves.

MAE and RMSE give use info about lowest possible error wrt confidence threshold, std give use info about the expected variation of count quality across the different samples of the validation set.

Ideally, you should aim for low Error AND low STD.

First cell uses Absolute metrics (MAE, RMSE) while second cell uses Relative metrics (MRE, RMSRE)

[12]:
abs_curves = curves["absolute"]
error = abs_curves[["MAE", "RMSE"]]

fig, ax = plt.subplots()
abs_curves[["MAE", "RMSE", "std"]].plot(ax=ax)
plt.scatter(error.idxmin(), error.min(), marker="+", zorder=10)
for x, y in zip(error.idxmin(), error.min()):
    ax.annotate(f"{x:.2f}", [x + 0.01, y + 20])
fig, ax = plt.subplots()
quantiles = [c for c in abs_curves.columns if c.startswith("q")]
abs_curves[quantiles].plot(ax=ax, colormap="coolwarm")
plt.grid()

abs_median = abs_curves["q0.50"].abs()
plt.scatter(abs_median.idxmin(), abs_median.min(), marker="+", zorder=10)
plt.title("Absolute metric and corresponding optimal confidence values")
plt.show()
../_images/notebooks_4_demo_evaluation_crowd_21_0.png
../_images/notebooks_4_demo_evaluation_crowd_21_1.png
[13]:
rel_curves = curves["relative"]
error = rel_curves[["MRE", "RMSRE"]]

fig, ax = plt.subplots()
rel_curves[["MRE", "RMSRE", "std"]].plot(ax=ax)
plt.scatter(error.idxmin(), error.min(), marker="+", zorder=10)
for x, y in zip(error.idxmin(), error.min()):
    ax.annotate(f"{x:.2f}", [x + 0.01, y + 0.01])
fig, ax = plt.subplots()
quantiles = [c for c in rel_curves.columns if c.startswith("q")]
rel_curves[quantiles].plot(ax=ax, colormap="coolwarm")
plt.grid()

abs_median = rel_curves["q0.50"].abs()
plt.scatter(abs_median.idxmin(), abs_median.min(), marker="+", zorder=10)
plt.show()
../_images/notebooks_4_demo_evaluation_crowd_22_0.png
../_images/notebooks_4_demo_evaluation_crowd_22_1.png