Creators and operators shall provide evidence of the effectiveness and fitness for purpose of A/IS.
## BackgroundThe responsible adoption and deployment of A/IS are essential if such systems are to realize their many potential benefits to the well-being of both individuals and societies. A/IS will not be trusted unless they can be shown to be effective in use. Harms caused by A/IS, from harm to an individual through to systemic damage, can undermine the perceived value of A/IS and delay or prevent its adoption.Operators and other users will therefore benefit from measurement of the effectiveness of the A/IS in question. To be adequate, effective measurements need to be both valid and accurate, as well as meaningful and actionable. And such measurements must be accompanied by practical guidance on how to interpret and respond to them.
## Recommendations1.Creators engaged in the development of A/IS should seek to define metrics or benchmarks that will serve as valid and meaningful gauges of the effectiveness of the system in meeting its objectives, adhering to standards and remaining within risk tolerances. Creators building A/IS should ensure that the results when the defined metrics are applied are readily obtainable by all interested parties, e.g., users, safety certifiers, and regulatorsof the system.2.Creators of A/IS should provide guidance on how to interpret and respond to the metrics generated by the systems.3.To the extent warranted by specific circumstances, operators of A/IS should follow the guidance on measurement provided with the systems, i.e., which metrics to obtain,how and when to obtain them, how to respond to given results, and so on.4.To the extent that measurements are sample based, measurements should account for the scope of sampling error, e.g., the reporting of confidence intervals associated with the measurements. Operators should be advised how to interpret the results.5.Creators of A/IS should design their systems such that metrics on specific deployments of the system can be aggregated to provide information on the effectiveness of the system across multiple deployments. For example, in the case of autonomous vehicles, metrics should be generated both for a specific instance of a vehicle and for a fleet of many instances of the same kind of vehicle.6.In interpreting and responding to measurements, allowance should be made for variation in the specific objectives and circumstances of a given deployment of A/IS. should work toward developing standards for the measurement and reporting on the effectiveness of A/IS.
## Further Resources
A. Steinfeld, T.W. Fong, D. Kaber, J. Scholtz, A. Schultz, and M. Goodrich, “Common Metrics for Human-Robot Interaction”, 2006 Human-Robot Interaction Conference, March,
R. Madhavan, E. Messina, and E. Tunstel, Eds., [Performance Evaluation and Benchmarking of Intelligent Systems](https://link.springer.com/book/
1007%2F978-1-4419-0492-8), Boston, MA: Springer,
IEEE Robotics & Automation Magazine, Special Issue on Replicable and Measurable Robotics Research, Volume 22, No. 3, September
C. Flanagin, A Survey on Robotics Systems and Performance Analysis,
Transaction Processing Performance Council (TPC) Establishes Artificial Intelligence Working Group (TPC-AI) tasked with developing industry standard benchmarks for both hardware and software platforms associated with running Artificial Intelligence (AI) based workloads, 2017p.26-27