Companies and institutions are moving full force to implement artificial intelligence (AI) and its many applications with the potential and promise of making work better, faster and cheaper. However, there are risks of using AI-based systems mindlessly.
Many companies are using AI-based hiring tools to rate job candidates with scores, and the tool taught itself that male candidates were preferable to female candidates. In one case, the system penalized resumes that included the word “women’s,” as in “women’s chess club captain”, and it downgraded graduates of two all-women’s colleges, according to people familiar with the matter. The system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way, and was later scrapped.
Several companies, like HireVue, use AI-based systems to analyze job candidates’ speech and facial expressions in video interviews to reduce reliance on resumes alone. In several instances, the facial expressions of people of Asian descent were misinterpreted as being passive or dull and not suited for dynamic marketing jobs, a clear misunderstanding by the AI systems about cultural artifacts.
Another well-known example is the use of AI in recommending sentences for those who commit crimes. One such tool, COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) uses an AI algorithm to assess potential recidivism risk. In 2016, ProPublica investigated COMPAS and found that black defendants are almost twice as likely as white defendants to be labeled a higher risk but not actually re-offend, however COMPAS tended to make the opposite mistake with whites, meaning that it was more likely to wrongly predict that white people would not commit additional crimes if released compared to black defendants. This level of discrimination was an anathema to U.S. society. So the question is, why did this AI-based system make such a mistake?
Any AI model that relies so heavily on large amounts of data should ensure that the data itself is not compromised. In the case of recidivism, i.e. committing another crime after being released from prison, what COMPAS had was not conviction data on a subsequent crime, but rather the arrest record of people who had already been in prison once. In fact, in most states in the U.S. African Americans are arrested faster on the suspicion of committing a crime. Hence the data that was used to train the COMPAS model was exactly the wrong data to use and simply reflected prejudice in society rather than criminal behavior.
Given the deep learning (and autonomous learning) nature of the state-of-art AI algorithms, it is likely that biases can be difficult to detect until the end. In both examples above, a clear lesson is that the data that goes into training an AI model has to be examined and de-risked very carefully. Data used for training most AI models captures past history rather than present reality. Data on decisions and outcomes replicates human biases, for instance racism and sexism. The algorithms themselves can create new forms of biases if the criteria that the users provide for determining “success” of any kind are flawed. AI models need to be tested adequately and at regular intervals to de-risk them.
The question also remains why the AI development team did not realize that the outputs from their models were biased, and did not reflect the values of society or even the organization using it. If the AI-team had influential women, the hiring and recruitment system would have immediately caught the problem of recommending only men to hire. Similarly, if the COMPAS team had more minorities the outputs would have been challenged at the source. Teams that develop AI-based models and algorithms need to be diverse to ensure that biases that creep in because of a number of reasons, especially “corrupt” data, could be handled or mitigated quickly.
Then there is generative AI, sometimes called large language models (LLMs), which differs from previous forms of AI and analytics because of its ability to generate new content efficiently, often in “unstructured” forms, for example written text or images. One of the best known generative AI tools is ChatGPT, produced by OpenAI and heavily funded by Microsoft. There has been a large uptick in companies using LLMs in their operations. One of the biggest dangers of using these LLM AI models is that they are prone to “hallucinations”, i.e. answering questions with plausible but untrue assertions.
For example, when David Smerdon, an economist at the University of Queensland, asked ChatGPT “What is the most cited economics paper of all time?”, it answered “A Theory of Economic History” by Douglass North and Robert Thomas, published in the Journal of Economic History in 1969. Although a plausible answer, the paper does not exist! It was a hallucination of ChatGPT. Smerdon speculates why ChatGPT came to the wrong answer, i.e. hallucinated: the most cited economics papers often have “theory” and “economic” in them; if an article starts “a theory of economic . . . ” then “ . . . history” is a likely continuation. Douglass North, Nobel laureate, is a heavily cited economic historian, and he wrote a book with Robert Thomas. In other words, the citation is magnificently plausible. What ChatGPT deals with is not truth; it is plausibility.
As the use of AI becomes more ubiquitous, the risks for those using it and those affected by decisions made by these systems will increase. It is imperative that companies and institutions adopting AI, whether it be current state-of-art AI, generative AI or certainly artificial general intelligence (AGI), should make sure that human intelligence is the master of domains where artificial intelligence is used.
The 2023 edition of the Network Readiness Index, dedicated to the theme of trust in technology and the network society, will launch on November 20th with a hybrid event at Saïd Business School, University of Oxford. Register and learn more using this link.
For more information about the Network Readiness Index, visit https://networkreadinessindex.org/
Professor G. ‘Anand’ Anandalingam is the Ralph J. Tyser Professor of Management Science at the Robert H. Smith School of Business at the University of Maryland. Previously, he was Dean of Imperial College Business School at Imperial College London from August 2013 to July 2016 and Dean of the Smith School of Business at Maryland from 2007 to 2013. Before joining Smith in 2001, Anand was at the University of Pennsylvania for nearly 15 years where he was the National Center Chair of Resource and Technology Management in both the Penn Engineering School and the Wharton School of Business. Anand received his Ph.D. from Harvard University, and his B.A. from Cambridge University, U.K.