BRASS Business Guides: Artificial Intelligence (AI) for Business Librarians: Evaluating Outcomes

Evaluating AI as a source

To evaluate AI results, we need to consider several factors. In this section we will walk through the mechanics of current AI algorithms (as far as we can know them) to understand how AI tools are finding and using sources to generate results. We will consider these results through the familiar lenses of the ACRL framework and source evaluation tools such as CRAAP to see how AI generated material might be judged based on its credibility or accuracy.

Due to issues explained below, it is possible that some current AI tools may not pass credible source evaluation tests such as CRAAP. However, by establishing an understanding of how these AI tools work, and by learning how to adapt existing evaluation tools to AI results, you can continue to follow and evaluate as AI grows and evolves.

AI as a source

AI Chatbots and AI Technology:

The current batch of AI Chatbots are LLMs (Large-Language models) “a machine-learning system that autonomously learns from data and can produce sophisticated and seemingly intelligent writing after training on a massive data set of text” (van Dis, et al, 2023). The exact language training set data for each AI differ from product to product and the sources are considered proprietary information and are mostly kept secret. However, OpenAI (ChatGPT’s parent company) has released a preprint paper on their LLM methods, which explained that the dataset sources included Wikipedia, publicly available books, some academic articles, general websites, and publicly readable social media networks such as Reddit. (Radford, et al, 2023). The current iteration of the generator also notes that it does not have access to academic databases. Additionally, the language set used for the previous version of ChatGPT was only current up to 2021 (Radford, et al, 2023). Each individual AI product is expected to update and expand on its own timeline, according to its parent company. AI products pull their responses from the language training sets and libraries, not from the entire existing internet. This means that the current set of AI apps are restricted in their output to whatever subset of sources are available to them as input.

Natural language generator (NLG):

It’s important to note that ChatGPT and many other LLM AI models are training as much on using natural, human-sounding language, as they are training on retrieving information. In terms of developing an AI algorithm, this can lead to gaps in what is called alignment, the degree to which an AI performs tasks as humans expect or need it to. (Strickland, 2023). While ChatGPT has well-developed NLG capabilities, it still does not perform search tasks in alignment with what humans expect from a search function, including misquoting and hallucinating (making things up). In fact, the NLG performs so much better than people expect, that they attribute much higher levels of accuracy to the responses than is measured by verification studies.

AI Accuracy Issues

AI and accuracy or "hallucinations":

Because ChatGPT and other LLM’s are newer technology, there has not been time to build a substantive body of research literature on accuracy or hallucination rate. However, here are a few example studies done recently:

Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238.
- From the Results page of above paper: “Overall, 115 references were generated by ChatGPT, with a mean of 3.8±1.1 per paper. Among these references, 47% were fabricated, 46% were authentic but inaccurate, and only 7% were authentic and accurate. The likelihood of fabricated references significantly differed based on prompt variations; yet the frequency of authentic and accurate references remained low in all cases. Among the seven components evaluated for each reference, an incorrect PMID number was most common, listed in 93% of papers. Incorrect volume (64%), page numbers (64%), and year of publication (60%) were the next most frequent errors. The mean number of inaccurate components was 4.3±2.8 out of seven per reference.”
Walters, W.H., Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5
- From the Results page of above article: ” Although chatbots such as ChatGPT can facilitate cost-effective text generation and editing, factually incorrect responses (hallucinations) limit their utility. This study evaluates one particular type of hallucination: fabricated bibliographic citations that do not represent actual scholarly works. We used ChatGPT-3.5 and ChatGPT-4 to produce short literature reviews on 42 multidisciplinary topics, compiling data on the 636 bibliographic citations (references) found in the 84 papers. We then searched multiple databases and websites to determine the prevalence of fabricated citations, to identify errors in the citations to non-fabricated papers, and to evaluate adherence to APA citation format. Within this set of documents, 55% of the GPT-3.5 citations but just 18% of the GPT-4 citations are fabricated. Likewise, 43% of the real (non-fabricated) GPT-3.5 citations but just 24% of the real GPT-4 citations include substantive citation errors. Although GPT-4 is a major improvement over GPT-3.5, problems remain.”
Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. The American Economist, 0(0). https://doi.org/10.1177/05694345231218454
- From the Abstract: “In this study, we generate prompts derived from every topic within the Journal of Economic Literature to assess the abilities of both GPT-3.5 and GPT-4 versions of the ChatGPT large language model (LLM) to write about economic concepts. ChatGPT demonstrates considerable competency in offering general summaries but also cites non-existent references. More than 30% of the citations provided by the GPT-3.5 version do not exist and this rate is only slightly reduced for the GPT-4 version. Additionally, our findings suggest that the reliability of the model decreases as the prompts become more specific. We provide quantitative evidence for errors in ChatGPT output to demonstrate the importance of LLM verification.”

Studies on accuracy and hallucination rates will evolve along with AI products. Currently, accuracy of cited information remains a major issue in AI-generated information.

Black Box - Unknown Sources Issues

Black Box issues

As mentioned above, all current AI LLMs are proprietary software, owned by private companies. This means that their generative algorithms are trade secrets. Neither OpenAI nor any of the other major AI LLM developers have open-sourced their specific algorithms. All information on datasets used for LLMs have been voluntarily released by the corporate owners, and only to the degree they choose. For any particular AI tool, the actual sources being used to train or provide datasets may be partially or wholly unknown.

Additionally, current AI LLMs rely on specific prompts to generate information. If the user does not prompt the AI to provide a specific source or citation, generally none will be provided (Walters, et al, 2023). These two barriers represent a significant obstacle to the user to identifying and finding the source of an idea, claim, or statement in an AI LLM generated text.

AI as an academic source

Current AI Chatbot and Text generator models do not pull materials from the entire internet or from most paid subscription academic databases. Therefore the information available to the AI products is a far smaller, and less academic, set of sources than most academic libraries provide access to. Some AI chatbots may incorporate academic database sources in the future, however there is no way to know if the AI parent companies will disclose the extent or specific sources.

Some AI products currently scrape or pull publicly available sources for academic materials. These sources could be anything, from an article on an individual's professional website, to an open source academic repository, to open published articles, to predatory publishing sources, to non-academic sources publishing articles, papers, and opinion pieces. Without access to the AI's datasets, its not possible to know which generated text may be from which source. Complicating this lack of transparency is the possibility that the AI might possibly attach a hallucinated citation to a non-academic text or excerpt or source.

An additional consideration is the interaction of information literacy standards and academic integrity policies that may vary from institution to institution. Some institutions even leave the determination of allowable AI usage to individual instructors, as a part of that institution's expression of academic freedom. This can mean that librarians may need to establish guidelines and guides which take into account the evolving or changing status of AI as an acceptable or unacceptable source from academic standards.

AI as a credible source

Most of the common credible source evaluation methods used by academic libraries such as CRAAP, TRAAP, PROVEN, or the 5 W's focus on the specific source of a text or claim. Use of these models requires the researcher to analyze the source by characteristic details (date of publication, academic affiliation, author credentials, etc). Without access to the specific and accurate citation to a source, these models have no method of finding a source credible.

Even where citations are provided, the accuracy rate of a particular AI product may not be high enough to provide reliable credibility without the secondary step of verifying and then evaluating the sources individually. Researchers Hall and McKee (2024) caution that AI outputs accuracy issues are analogous to social media misinformation concerns and provide the following advice about the current AI models:

"It’s important that we never to entrust ChatGPT with too much responsibility or credibility. Fact-checking and proofreading every generative AI output is crucial–even for data outputs. Users should always take responsibility for the accuracy and reliability of their work." (Hall & McKee, 2024).

Other authors cited here (Walters & Wilder, 2023)(Gravel et al, 2023) have suggested similar caution when using AI tools to provide sources or analytical results.

Evaluating AI as a tool

The LibrAIry has created the ROBOT test to consider when using AI technology. It provides a framework for evaluating specific AI tools to help you determine which ones would best meet your needs.

Reliability

Objective

Bias

Ownership

Type

Reliability

How reliable is the information available about the AI technology?
If it’s not produced by the party responsible for the AI, what are the author’s credentials? Bias?
If it is produced by the party responsible for the AI, how much information are they making available?
- Is information only partially available due to trade secrets?
- How biased is they information that they produce?

Objective

What is the goal or objective of the use of AI?
What is the goal of sharing information about it?
- To inform?
- To convince?
- To find financial support?

Bias

What could create bias in the AI technology?
Are there ethical issues associated with this?
Are bias or ethical issues acknowledged?
- By the source of information?
- By the party responsible for the AI?
- By its users?

Owner

Who is the owner or developer of the AI technology?
Who is responsible for it?
- Is it a private company?
- The government?
- A think tank or research group?
Who has access to it?
Who can use it?

Type

Which subtype of AI is it?
Is the technology theoretical or applied?
What kind of information system does it rely on?
Does it rely on human intervention?

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Evaluating AI Output

Many of us are familiar with information literacy tests used to assess the accuracy and credibility of resources, such as the CRAAP Test. Analogous to this, it is important for users of Generative AI technologies to develop the skills to effectively assess both one's own inputs (also known as 'prompt engineering') and the model's corresponding outputs.

Below, we highlight two frameworks to consider: the ROBOT Test and the CLEAR Framework.

In March 2020, Sandy Hervieux and Amanda Wheatley published a blog post titled "The ROBOT Test" which contains a tool to assess the legitimacy of AI technologies.

There are five factors, which are detailed in-depth within their post: Reliability; Objective; Bias; Ownership; Type. Holistically, these help users think about the inputs, outputs, environmental influences, and authority of an AI application.

In July 2023, Leo Lo published a journal article titled "The CLEAR path: A framework for enhancing information literacy through prompt engineering" which details a framework to optimize interactions with AI language models.

There are five factors, detailed in-depth within their article: Concise; Logical; Explicit; Adaptive; Reflective. Holistically, these help users develop critical thinking skills surrounding the usage of Generative AI, and helps instructors enhance their practices around information & digital literacy instruction. The

References

Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238

Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations:

Evidence from Economics. The American Economist, 0(0). https://doi.org/10.1177/05694345231218454

Gravel, J., D'amours-Gravel, M., & Osmanlliu, E., (2023). Learning to fake it: Limited responses and fabricated references provided by

ChatGPT for medical questions. Mayo Clinic Proceedings; Digital Health. Vol. 1 (3) pp.226-234.

https://doi.org/10.1016/j.mcpdig.2023.05.004

Hall, B., & McKee. J. (2024) An early or somewhat late ChatGPT guide for librarians,

Journal of Business & Finance Librarianship, 29:1, 58-69, DOI:10.1080/08963568.2024.2303944

James, A. B., & Filgo, E. H. (2023). Where does ChatGPT fit into the Framework for

Information Literacy? The possibilities and problems of AI in library instruction. College & Research Libraries News, 84(9), 334.

Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. (2023). Improving Language

Understanding by Generative Pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Strickland, E. (2023, August 31). Open AI’s Moonshot: Solving the AI Alignment Problem.

IEEE Spectrum. https://spectrum.ieee.org/the-alignment-problem-openai

van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R., & Bockting, C. L. (2023). Chatgpt:

five priorities for research. Nature : International Weekly Journal of Science, 614(7947), 224–226. https://doi.org/10.1038/d41586-023-00288-7

Walters, W.H. & Wilder, E.I. (2023). Fabrication and errors in the bibliographic citations

generated by ChatGPT. Sci Rep 13, 14045