Online appendix for Estimating Combinatorial t-way Coverage based on Matrix Complexity Metrics

Luiza Corpaci¹, Michael Wagner², Sebastian Raubitzek³, Ludwig Kampel², Kevin Mallinger¹, Dimitris Simos²

¹Christian Doppler Laboratory for Assurance and Transparency in Software Protection, University of Vienna, Austria

²MATRIS Research Group, SBA Research, Austria

³CORE Research Group, SBA Research, Austria

A. Full Results

This appendix collects all RMSEs from all conducted experiments. It includes the RMSE for each instance and a corresponding maximal residual, thus extending the results displayed in the main text in Figure 3. These results, i.e., the RMSE and the corresponding maximal residuals, are compiled in Table 1. The inclusion of the maximal residuals provides a metric that addresses the volatility of both our trained machine learning models and the baseline to which we compare, i.e., the probabilistic coverage estimate. The maximal residual answers the question: How wrong can my estimator be?.

However, considering the results displayed in Table 1, we observe that the maximal residual generally exhibits the same behavior as the RMSE, i.e., if the RMSE of our models is below the baseline, the maximal residual of our models is also below that of the baseline and vice versa.

The heatmaps shown in Figures 1 and 2 show the performance of different models for estimating the t-way coverage. Each heatmap characterizes a table where the rows represent the size of the alphabet, and the columns represent the number of columns of the analysed randomly generated test set. The heatmaps are organised such that each model has an associated column and each t-strength has one row.

The probabilistic estimator, Lasso and LightGBM are assessed at various strengths ( $t \in {2, 3, 4}$ ). The shades of red indicate the magnitude of errors (e.g., RMSE - Root Mean Squared Error) across the analysed scenarios (alphabet size $v \in {2, 3, 4, 5}$ and number of columns $k \in {20, 50, 100, 200, 1000})$ . Darker reds suggest higher errors, whereas lighter shades imply lower errors. This visual encoding offers a quick comparison of how model performance scales with increasing values for t, v and k.

Consistent with the results presented in Section 5, we see that machine learning models generally outperform the probabilistic baseline for binary test sets, except when the test sets are very large (high number of tests), where the performance of LightGBM and the probabilistic baseline converge. Notably, the Lasso model without the probabilistic baseline performs significantly worse, while the Lasso model with the probabilistic baseline performs increases to a performance level comparable to the other models.

Table 1

Comparison of RMSE and Max Residuals per (strength: t, alphabet size: v, columns: k) for all tested models.

t	v	k	RMSE					Max Residual
t	v	k	prob	Lasso	Lasso +prob	LGBM	LGBM +prob	prob	Lasso	Lasso +prob	LGBM	LGBM +prob
t=2	2	20	3.51	2.36	2.38	2.54	2.54	11.65	7.20	7.23	8.95	9.10
		50	2.11	1.71	1.73	1.69	1.68	7.07	4.92	4.95	5.66	5.40
		100	1.39	1.21	1.21	1.04	1.05	4.78	4.08	4.12	3.86	4.38
		200	1.04	0.94	0.94	0.76	0.77	3.19	2.83	2.84	3.21	3.07
		500	0.65	0.61	0.59	0.48	0.48	1.84	1.92	1.70	1.37	1.38
		1000	0.45	0.46	0.41	0.32	0.31	1.80	1.81	1.68	1.15	1.10
	3	20	1.92	1.83	1.83	1.88	1.88	6.56	6.44	6.37	6.11	6.51
		50	1.14	1.16	1.09	1.12	1.13	4.43	4.49	4.31	3.85	3.78
		100	0.81	0.90	0.80	0.80	0.80	2.89	3.16	2.55	2.22	2.21
		200	0.63	0.79	0.62	0.61	0.60	2.00	2.76	1.94	2.19	1.96
		500	0.38	0.45	0.36	0.39	0.40	1.16	1.45	1.13	1.94	2.39
		1000	0.26	0.43	0.26	0.26	0.26	0.76	1.45	0.74	0.85	0.80
	4	20	1.30	1.54	1.27	1.29	1.29	4.61	6.21	5.00	4.92	4.78
		50	0.76	0.83	0.75	0.78	0.77	2.74	2.60	2.64	2.54	2.56
		100	0.54	0.61	0.53	0.57	0.56	1.86	1.70	1.74	1.77	1.91
		200	0.37	0.50	0.38	0.41	0.41	1.13	1.75	1.17	1.25	1.39
		500	0.23	0.36	0.23	0.24	0.24	0.74	1.12	0.69	0.78	0.76
		1000	0.17	0.31	0.17	0.19	0.20	0.67	0.85	0.67	1.24	1.34
	5	20	0.93	1.23	0.91	0.97	0.95	3.16	3.52	3.27	3.18	3.11
		50	0.52	0.58	0.51	0.54	0.54	1.62	2.08	1.73	1.73	1.84
		100	0.37	0.52	0.37	0.39	0.39	1.19	1.79	1.20	1.19	1.17
		200	0.27	0.46	0.28	0.30	0.29	0.95	1.25	0.98	1.53	0.93
		500	0.17	0.38	0.17	0.21	0.20	0.56	1.10	0.56	1.97	1.71
		1000	0.11	0.31	0.11	0.14	0.13	0.37	0.85	0.39	0.43	0.43
t=3	2	20	2.30	1.60	1.59	1.46	1.46	6.58	4.64	4.81	4.90	4.94
		50	1.38	1.00	1.01	0.90	0.89	5.97	4.84	5.06	4.43	4.45
		100	0.99	0.78	0.74	0.60	0.59	3.29	2.94	2.48	1.73	1.79
		200	0.70	0.66	0.62	0.44	0.44	2.10	2.29	1.93	1.29	1.27
		500	0.42	0.54	0.39	0.30	0.31	1.99	1.92	2.03	1.39	1.46
		1000	0.30	0.49	0.29	0.23	0.23	1.09	2.01	0.94	0.80	0.77
	3	20	0.86	1.19	0.85	0.85	0.84	3.28	4.62	3.35	2.79	2.67
		50	0.56	0.64	0.54	0.57	0.56	1.95	2.22	1.87	1.84	1.79
		100	0.37	0.51	0.37	0.40	0.40	1.55	1.92	1.66	1.76	1.71
		200	0.26	0.37	0.26	0.30	0.29	1.03	1.20	1.06	1.25	1.32
		500	0.17	0.37	0.17	0.21	0.20	0.72	1.14	0.70	1.24	0.79
		1000	0.12	0.36	0.17	0.13	0.13	0.44	1.32	0.58	0.45	0.48
	4	20	0.46	0.74	0.45	0.47	0.47	1.40	2.33	1.46	1.48	1.57
		50	0.28	0.79	0.28	0.29	0.29	0.96	2.08	0.97	0.91	0.90
		100	0.19	0.64	0.19	0.22	0.22	0.66	2.62	0.64	0.93	0.91
		200	0.13	0.37	0.13	0.17	0.17	0.42	1.90	0.43	0.81	0.92
		500	0.08	0.37	0.08	0.11	0.10	0.30	1.33	0.32	0.70	0.48
		1000	0.06	0.31	0.06	0.10	0.10	0.22	0.70	0.22	0.41	0.51
	5	20	0.31	0.62	0.31	0.33	0.33	0.94	2.03	0.98	1.15	1.17
		50	0.16	0.91	0.17	0.19	0.19	0.60	3.94	0.60	0.61	0.69
		100	0.11	0.76	0.11	0.15	0.15	0.45	2.41	0.47	0.56	0.43
		200	0.08	0.59	0.08	0.12	0.11	0.24	3.17	0.23	0.54	0.36
		500	0.05	0.32	0.05	0.07	0.08	0.15	0.73	0.17	0.48	0.47
		1000	0.03	0.33	0.04	0.08	0.08	0.11	0.94	0.13	0.42	0.47
t=4	2	20	1.47	1.19	1.19	1.06	1.04	5.87	4.48	4.48	4.47	4.53
		50	0.90	0.75	0.76	0.62	0.62	2.74	2.17	2.14	2.27	2.21
		100	0.60	0.60	0.55	0.44	0.46	2.26	1.59	1.80	1.33	1.35
		200	0.41	0.48	0.40	0.36	0.32	1.60	1.77	1.48	1.49	1.31
	3	20	0.45	0.78	0.43	0.44	0.44	1.58	2.74	1.55	1.39	1.46
		50	0.24	0.87	0.23	0.26	0.26	1.02	2.88	1.03	0.79	0.87
		100	0.17	0.60	0.17	0.20	0.19	0.54	2.46	0.58	0.71	0.63
		200	0.12	0.26	0.13	0.14	0.15	0.40	0.73	0.41	0.60	0.63
	4	20	0.19	0.57	0.19	0.22	0.21	0.77	2.35	0.79	0.88	0.84
		50	0.10	0.88	0.10	0.15	0.12	0.33	2.76	0.33	0.61	0.40
		100	0.06	0.65	0.07	0.13	0.11	0.26	2.05	0.28	0.43	0.56
		200	0.05	0.86	0.05	0.08	0.08	0.14	3.33	0.16	0.41	0.30
	5	20	0.09	0.66	0.10	0.14	0.15	0.33	1.60	0.32	0.47	0.58
		50	0.05	0.91	0.05	0.11	0.10	0.14	3.68	0.13	0.49	0.45
		100	0.03	0.67	0.03	0.09	0.12	0.10	2.74	0.10	0.44	0.58
		200	0.02	0.57	0.03	0.07	0.08	0.09	1.42	0.09	0.33	0.32

Figure 1

Comparison of RMSE for different alphabet sizes (v), number of columns (k) and strength values (t): probabilistic baseline (1st column) vs Lasso without the probabilistic baseline (2nd column) vs LightGBM without the probabilistic baseline (3rd column)

Figure 2

Comparison of RMSE for different alphabet sizes (v), number of columns (k) and strength values (t): probabilistic baseline (1st column) vs Lasso with probabilistic baseline included (2nd column) vs LightGBM with probabilistic baseline (3rd column)

B. Feature Importances

The following are the outcomes considering all features of our feature importance analysis for both LightGBM and the Lasso regressor, with their respective importance plotted as bars, see Fig. 3. We averaged these feature importance scores over all our experiments, meaning that we conducted an experiment for each combination of alphabet v, strength t, and number of columns k, as shown in Table 1. We then averaged over v, t, and k for both regressors but differentiated between datasets with the baseline included and datasets without the baseline.

Figure 3

Full spectrum of the normalized feature importance values for Lasso (top-left), Lasso with the probabilistic baseline included (bottom-left), LightGBM (top-right) and LightGBM including the proba bilistic baseline (bottom-right).

C. Runtime analysis

In order to compare our best results, i.e., our best-performing LightGBM model, in terms of computational efficiency, we performed a run-time analysis of our ML approach compared to a state-of-the-art tool to compute the t-way coverage of a randomly generated test set — CAmetrics. Here, we averaged the time it takes to compute the coverage for 10 different test sets using the Cametrics tool and the developed ML approach. We further separated between the time it takes to calculate the SVD of the regarded test sets and the prediction time using a trained model. These runtimes are shown in Table 2. One thing that seems counterintuitive about these runtimes is that it takes longer to predict test sets with fewer tests using our trained models. However, we cannot come up with a definite explanation for this except that our approach requires more complex models for smaller matrices, and thus the prediction time increases with increasing matrix size. This might also be linked to the fact that smaller matrices provide fewer singular values. Thus, the metrics are less expressive, requiring more complex LightGBM model architectures in terms of tree-based, boosted structures within the model.

Table 2

Comparison between the runtimes (in miliseconds) of the SVD computation, the average prediction time for LightGBM (best performing model), as well as the sum of the SVD and LightGBM times against the CAmetrics runtime

v	k	t	SVD	LightGBM	SVD + LightGBM	CAmetrics
2	20	2	1.850	10.441	12.291	65.447
2	20	3	1.956	9.827	11.783	74.637
2	20	4	1.994	2.912	4.906	75.703
3	200	2	10.708	4.992	15.700	66.297
3	200	3	12.166	1.511	13.677	216.461
3	200	4	32.036	1.172	33.208	247.448
5	1000	2	50.129	1.127	51.256	151.461
5	1000	3	101.815	0.875	102.690	272.792