Strategies

Questions to consider in key stages of AI and machine learning based research, regarding data collection

“Data collection for the model training and the research can be done either by using open data repositories or by directly recruiting participants to make a new dataset. The use of existing datasets raises issues around its intended openness, consent for reuse, and the change of context for which the data is used (Nissenbaum, 2001, 2009). Collecting new data raises issues around meaningful informed consent, whether the subjects are aware of what their data and the resulting research outputs will be used for, how this will affect them and others, and the representation of humans by a necessarily more limited model. More general questions arise about privacy as a concept to allow data subjects self-determination and control over how data about them is used. Further, respect for autonomy ensures an individual’s ability to make decisions for themselves, and to act upon them. Modern digital data collection (e.g. Application Programming Interfaces) and processing techniques have put the various concepts of privacy and autonomy under significant strain. It is therefore important for researchers to be mindful of ways to minimize the risk to research subjects’ and any violations of privacy and autonomy by third parties. Further, applying technological solutions such as encryption are often mistakenly classed as efforts to improve privacy, while they instead provide more security. Similarly, not disclosing information is called confidentiality, not necessarily privacy. *General Data Collection: *- Are the identified data points necessary, relevant and not excessive in relation to the research aim? - To what extent will data in the database identify individuals directly, or indirectly through inference? - Do the datasets contain classifiers that are particularly sensitive or even protected classes? If so, what purpose do they serve? Can data points be used as proxies to reconstruct sensitive and protected classes? Is it possible to prevent the re-construction of sensitive and protected classes? - How does the researcher protect the privacy of its users beyond security measures? For example, is data deleted after a certain amount of time? Is data that is not used for the purpose of the model deleted upon its inadvertent collection? Existing Datasets: - Is the existing dataset explicitly open for public or research, or was this dataset found without its reuse permissions being specified? - Is the use of the existing dataset restricted by legal or other means? - Could the data subjects (whether anonymized or not) in the existing dataset conceivably object to the new use of their data? Does the initial consent (informed or proxy) cover the intended re-use of the dataset? - What are the limitations in the knowledge derived from the data in modeling individual and collective behaviour in its totality? How does this limit the generalizability of the findings of the study or the applicability of the precision and/or predictors found? *New Data Collection: *- Have data subjects consented to the collection of their data with a full understanding of what is being collected, for which purposes, and with an understanding of how the data will be used by the researcher? If not, have the collection processes gone through an ethical review board? And/or how has the research team reflected on how to otherwise gain proxy consent and the potential consequences of the proxy status? - How could potential risks of harm be communicated to the research participants before entering the study? - To what extent can researchers confirm whether people understand the consequences of derivative uses of their data in AI and ML, knowing from existing literature that the concept of ‘informed’ consent may not be meaningful for the data subjects? - Has the organization decided how the privacy of data subjects is safeguarded? - Does the system collect more information than it needs - Are data subjects empowered to decide which data is collected and which inferences are made about them? - Can the data subject have access to their data? Can they choose to withdraw their data from the system?”(franzke et al., 2020, p. 38-40)

Challenge Instances Technologies mediate the interactions and distance between researchers and participants

Overarching Principles Beneficence Respect for persons

Principles Minimise risks of harms Protection of vulnerable persons

Sources AoIR report 3

Created At 2023-05-19T12:24:51.000Z

Title Questions to consider in key stages of AI and machine learning based research, regarding data collection