To evaluate AI results, we need to consider several factors. In this section we will walk through the mechanics of current AI algorithms (as far as we can know them) to understand how AI tools are finding and using sources to generate results. We will consider these results through the familiar lenses of the ACRL framework and source evaluation tools such as CRAAP to see how AI generated material might be judged based on its credibility or accuracy.
Due to issues explained below, it is possible that some current AI tools may not pass credible source evaluation tests such as CRAAP. However, by establishing an understanding of how these AI tools work, and by learning how to adapt existing evaluation tools to AI results, you can continue to follow and evaluate as AI grows and evolves.
AI Chatbots and AI Technology:
The current batch of AI Chatbots are LLMs (Large-Language models) “a machine-learning system that autonomously learns from data and can produce sophisticated and seemingly intelligent writing after training on a massive data set of text” (van Dis, et al, 2023). The exact language training set data for each AI differ from product to product and the sources are considered proprietary information and are mostly kept secret. However, OpenAI (ChatGPT’s parent company) has released a preprint paper on their LLM methods, which explained that the dataset sources included Wikipedia, publicly available books, some academic articles, general websites, and publicly readable social media networks such as Reddit. (Radford, et al, 2023). The current iteration of the generator also notes that it does not have access to academic databases. Additionally, the language set used for the previous version of ChatGPT was only current up to 2021 (Radford, et al, 2023). Each individual AI product is expected to update and expand on its own timeline, according to its parent company. AI products pull their responses from the language training sets and libraries, not from the entire existing internet. This means that the current set of AI apps are restricted in their output to whatever subset of sources are available to them as input.
Natural language generator (NLG):
It’s important to note that ChatGPT and many other LLM AI models are training as much on using natural, human-sounding language, as they are training on retrieving information. In terms of developing an AI algorithm, this can lead to gaps in what is called alignment, the degree to which an AI performs tasks as humans expect or need it to. (Strickland, 2023). While ChatGPT has well-developed NLG capabilities, it still does not perform search tasks in alignment with what humans expect from a search function, including misquoting and hallucinating (making things up). In fact, the NLG performs so much better than people expect, that they attribute much higher levels of accuracy to the responses than is measured by verification studies.
AI and accuracy or "hallucinations":
Because ChatGPT and other LLM’s are newer technology, there has not been time to build a substantive body of research literature on accuracy or hallucination rate. However, here are a few example studies done recently:
Studies on accuracy and hallucination rates will evolve along with AI products. Currently, accuracy of cited information remains a major issue in AI-generated information.
Black Box issues
As mentioned above, all current AI LLMs are proprietary software, owned by private companies. This means that their generative algorithms are trade secrets. Neither OpenAI nor any of the other major AI LLM developers have open-sourced their specific algorithms. All information on datasets used for LLMs have been voluntarily released by the corporate owners, and only to the degree they choose. For any particular AI tool, the actual sources being used to train or provide datasets may be partially or wholly unknown.
Additionally, current AI LLMs rely on specific prompts to generate information. If the user does not prompt the AI to provide a specific source or citation, generally none will be provided (Walters, et al, 2023). These two barriers represent a significant obstacle to the user to identifying and finding the source of an idea, claim, or statement in an AI LLM generated text.
Current AI Chatbot and Text generator models do not pull materials from the entire internet or from most paid subscription academic databases. Therefore the information available to the AI products is a far smaller, and less academic, set of sources than most academic libraries provide access to. Some AI chatbots may incorporate academic database sources in the future, however there is no way to know if the AI parent companies will disclose the extent or specific sources.
Some AI products currently scrape or pull publicly available sources for academic materials. These sources could be anything, from an article on an individual's professional website, to an open source academic repository, to open published articles, to predatory publishing sources, to non-academic sources publishing articles, papers, and opinion pieces. Without access to the AI's datasets, its not possible to know which generated text may be from which source. Complicating this lack of transparency is the possibility that the AI might possibly attach a hallucinated citation to a non-academic text or excerpt or source.
An additional consideration is the interaction of information literacy standards and academic integrity policies that may vary from institution to institution. Some institutions even leave the determination of allowable AI usage to individual instructors, as a part of that institution's expression of academic freedom. This can mean that librarians may need to establish guidelines and guides which take into account the evolving or changing status of AI as an acceptable or unacceptable source from academic standards.
Most of the common credible source evaluation methods used by academic libraries such as CRAAP, TRAAP, PROVEN, or the 5 W's focus on the specific source of a text or claim. Use of these models requires the researcher to analyze the source by characteristic details (date of publication, academic affiliation, author credentials, etc). Without access to the specific and accurate citation to a source, these models have no method of finding a source credible.
Even where citations are provided, the accuracy rate of a particular AI product may not be high enough to provide reliable credibility without the secondary step of verifying and then evaluating the sources individually. Researchers Hall and McKee (2024) caution that AI outputs accuracy issues are analogous to social media misinformation concerns and provide the following advice about the current AI models:
"It’s important that we never to entrust ChatGPT with too much responsibility or credibility. Fact-checking and proofreading every generative AI output is crucial–even for data outputs. Users should always take responsibility for the accuracy and reliability of their work." (Hall & McKee, 2024).
Other authors cited here (Walters & Wilder, 2023)(Gravel et al, 2023) have suggested similar caution when using AI tools to provide sources or analytical results.
The LibrAIry has created the ROBOT test to consider when using AI technology. It provides a framework for evaluating specific AI tools to help you determine which ones would best meet your needs.
Reliability
Objective
Bias
Ownership
Type
Reliability
Objective
Bias
Owner
Type
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Many of us are familiar with information literacy tests used to assess the accuracy and credibility of resources, such as the CRAAP Test. Analogous to this, it is important for users of Generative AI technologies to develop the skills to effectively assess both one's own inputs (also known as 'prompt engineering') and the model's corresponding outputs.
Below, we highlight two frameworks to consider: the ROBOT Test and the CLEAR Framework.
In March 2020, Sandy Hervieux and Amanda Wheatley published a blog post titled "The ROBOT Test" which contains a tool to assess the legitimacy of AI technologies.
There are five factors, which are detailed in-depth within their post: Reliability; Objective; Bias; Ownership; Type. Holistically, these help users think about the inputs, outputs, environmental influences, and authority of an AI application.
In July 2023, Leo Lo published a journal article titled "The CLEAR path: A framework for enhancing information literacy through prompt engineering" which details a framework to optimize interactions with AI language models.
There are five factors, detailed in-depth within their article: Concise; Logical; Explicit; Adaptive; Reflective. Holistically, these help users develop critical thinking skills surrounding the usage of Generative AI, and helps instructors enhance their practices around information & digital literacy instruction. The
Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238
Buchanan, J., Hill, S., & Shapoval, O. (2023). ChatGPT Hallucinates Non-existent Citations:
Evidence from Economics. The American Economist, 0(0). https://doi.org/10.1177/05694345231218454
Gravel, J., D'amours-Gravel, M., & Osmanlliu, E., (2023). Learning to fake it: Limited responses and fabricated references provided by
ChatGPT for medical questions. Mayo Clinic Proceedings; Digital Health. Vol. 1 (3) pp.226-234.
https://doi.org/10.1016/j.mcpdig.2023.05.004
Hall, B., & McKee. J. (2024) An early or somewhat late ChatGPT guide for librarians,
Journal of Business & Finance Librarianship, 29:1, 58-69, DOI:10.1080/08963568.2024.2303944
James, A. B., & Filgo, E. H. (2023). Where does ChatGPT fit into the Framework for
Information Literacy? The possibilities and problems of AI in library instruction. College & Research Libraries News, 84(9), 334.
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. (2023). Improving Language
Understanding by Generative Pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
Strickland, E. (2023, August 31). Open AI’s Moonshot: Solving the AI Alignment Problem.
IEEE Spectrum. https://spectrum.ieee.org/the-alignment-problem-openai
van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R., & Bockting, C. L. (2023). Chatgpt:
five priorities for research. Nature : International Weekly Journal of Science, 614(7947), 224–226. https://doi.org/10.1038/d41586-023-00288-7
Walters, W.H. & Wilder, E.I. (2023). Fabrication and errors in the bibliographic citations
generated by ChatGPT. Sci Rep 13, 14045
Content on this site is licensed under a
Creative Commons Attribution 4.0 International license.
©2000-2021 BRASS Education Committee.
BRASS acknowledges Springshare's generous support in hosting the BRASS Business Guides.