from mistralai import Mistral
import sys
import os
from pathlib import Path
1, str(Path.cwd().parent.parent))
sys.path.insert(
from src.fs import write_text_file
from src.mistral_ocr import MyMistralOCRpdf
9 Mistral OCR (Usage)
9.1 Single PDF
9.1.1 Mistral Example
= MyMistralOCRpdf("docs/mistral7b.pdf", image_folder_path = "img/mistral7b")
my_mistral_ocr my_mistral_ocr
MyMistralOCR(file_path='docs/mistral7b.pdf', image_folder_path='img/mistral7b', ocr_status='not processed', ocr_response=None)
my_mistral_ocr.ocr_pdf()
OCRResponse(pages=[OCRPageObject(index=0, markdown='# Mistral 7B \n\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed\n\n\n\n\n#### Abstract\n\nWe introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B - Instruct, that surpasses Llama 2 13B - chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/\n\n\n## 1 Introduction\n\nIn the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks.\n\nMistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.', images=[OCRImageObject(id='img-0.jpeg', top_left_x=425, top_left_y=600, bottom_right_x=1283, bottom_right_y=893, image_base64='')], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=1, markdown='Mistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation ${ }^{1}$ facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot ${ }^{2}$. Integration with Hugging Face ${ }^{3}$ is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B - Chat model.\n\nMistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications.\n\n# 2 Architectural details \n\n\n\nFigure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W=3$ ). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after $k$ attention layers, information can move forward by up to $k \\times W$ tokens.\n\nMistral 7B is based on a transformer architecture [27]. The main parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below.\nSliding Window Attention. SWA exploits the stacked layers of a transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k, h_{i}$, attends to all hidden states from the previous layer with positions between $i-W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance of up to $W \\times k$ tokens, as illustrated in Figure 1. At the last layer, using a window size of $W=4096$, we have a theoretical attention span of approximately $131 K$ tokens. In practice, for a sequence length of 16 K and $W=4096$, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.\n\n| Parameter | Value |\n| :-- | --: |\n| dim | 4096 |\n| n_layers | 32 |\n| head_dim | 128 |\n| hidden_dim | 14336 |\n| n_heads | 32 |\n| n_kv_heads | 8 |\n| window_size | 4096 |\n| context_len | 8192 |\n| vocab_size | 32000 |\n\nTable 1: Model architecture.\n\nRolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i \\bmod W$ of the cache. As a result, when the position $i$ is larger than $W$, past values in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration in Figure 2 for $W=3$. On a sequence length of 32 k tokens, this reduces the cache memory usage by 8 x , without impacting the model quality.\n\n[^0]\n[^0]: ${ }^{1}$ https://github.com/mistralai/mistral-src\n ${ }^{2}$ https://github.com/skypilot-org/skypilot\n ${ }^{3}$ https://huggingface.co/mistralai', images=[OCRImageObject(id='img-1.jpeg', top_left_x=294, top_left_y=638, bottom_right_x=1405, bottom_right_y=1064, image_base64='')], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=2, markdown='\n\nFigure 2: Rolling buffer cache. The cache has a fixed size of $W=4$. Keys and values for position $i$ are stored in position $i \\bmod W$ of the cache. When the position $i$ is larger than $W$, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange.\n\nPre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the $(k, v)$ cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\n\n| the | The cat sat on the mat and saw the dog go to | | | | | | | | | | | |\n| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |\n| dog | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |\n| go | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |\n| to | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |\n| | Past | | | | | | | | | | | |\n\nFigure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, "The cat sat on", "the mat and saw", "the dog go to". The figure shows what happens for the third chunk ("the dog go to"): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of the sliding window (left block).\n\n# 3 Results \n\nWe compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:\n\n- Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]\n- World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]\n- Reading Comprehension (0-shot): BoolQ [8], QuAC [7]\n- Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4\n- Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)\n- Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)\n\nDetailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama $134 B^{4}$ in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.\n\n[^0]\n[^0]: ${ }^{4}$ Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.', images=[OCRImageObject(id='img-2.jpeg', top_left_x=294, top_left_y=191, bottom_right_x=1405, bottom_right_y=380, image_base64='')], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=3, markdown='\n\nFigure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1 34B in mathematics, code generation, and reasoning benchmarks.\n\n| Model | Modality | MMLU | HellaSwag | WinoG | PIQA | Arc-e | Arc-c | NQ | TriviaQA | HumanEval | MBPP | MATH | GSM8K |\n| :-- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| LLaMA 2 7B | Pretrained | $44.4 \\%$ | $77.1 \\%$ | $69.5 \\%$ | $77.9 \\%$ | $68.7 \\%$ | $43.2 \\%$ | $24.7 \\%$ | $63.8 \\%$ | $11.6 \\%$ | $26.1 \\%$ | $3.9 \\%$ | $16.0 \\%$ |\n| LLaMA 2 13B | Pretrained | $55.6 \\%$ | $\\mathbf{8 0 . 7 \\%}$ | $72.9 \\%$ | $80.8 \\%$ | $75.2 \\%$ | $48.8 \\%$ | $\\mathbf{2 9 . 0 \\%}$ | $\\mathbf{6 9 . 6 \\%}$ | $18.9 \\%$ | $35.4 \\%$ | $6.0 \\%$ | $34.3 \\%$ |\n| Code-Llama 7B | Finetuned | $36.9 \\%$ | $62.9 \\%$ | $62.3 \\%$ | $72.8 \\%$ | $59.4 \\%$ | $34.5 \\%$ | $11.0 \\%$ | $34.9 \\%$ | $\\mathbf{3 1 . 1 \\%}$ | $\\mathbf{5 2 . 5 \\%}$ | $5.2 \\%$ | $20.8 \\%$ |\n| Mistral 7B | Pretrained | $\\mathbf{6 0 . 1 \\%}$ | $\\mathbf{8 1 . 3 \\%}$ | $\\mathbf{7 5 . 3 \\%}$ | $\\mathbf{8 3 . 0 \\%}$ | $\\mathbf{8 0 . 0 \\%}$ | $\\mathbf{5 5 . 5 \\%}$ | $\\mathbf{2 8 . 8 \\%}$ | $\\mathbf{6 9 . 9 \\%}$ | $\\mathbf{3 0 . 5 \\%}$ | $47.5 \\%$ | $\\mathbf{1 3 . 1 \\%}$ | $\\mathbf{5 2 . 2 \\%}$ |\n\nTable 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.\n\nSize and Efficiency. We computed "equivalent model sizes" of the Llama 2 family, aiming to understand Mistral 7B models\' efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B\'s performance achieves a lower compression rate of 1.9 x , which is likely due to its limited parameter count that restricts the amount of knowledge it can store.\n\nEvaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.\n\n## 4 Instruction Finetuning\n\nTo evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B - Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance. In Table 3, we observe that the resulting model, Mistral 7B - Instruct, exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B - Chat models. An independent human evaluation was conducted on https://limboxing.com/leaderboard.\n\n| Model | Chatbot Arena <br> ELO Rating | MT Bench |\n| :-- | :--: | :--: |\n| WizardLM 13B v1.2 | 1047 | 7.2 |\n| Mistral 7B Instruct | $\\mathbf{1 0 3 1}$ | $\\mathbf{6 . 8 4}$ +/- $\\mathbf{0 . 0 7}$ |\n| Llama 2 13B Chat | 1012 | 6.65 |\n| Vicuna 13B | 1041 | 6.57 |\n| Llama 2 7B Chat | 985 | 6.27 |\n| Vicuna 7B | 997 | 6.17 |\n| Alpaca 13B | 914 | 4.53 |\n\nTable 3: Comparison of Chat models. Mistral 7B Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B - Chat models.\n\nIn this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.', images=[OCRImageObject(id='img-3.jpeg', top_left_x=292, top_left_y=204, bottom_right_x=1390, bottom_right_y=552, image_base64='')], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=4, markdown='\n\nFigure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the amount of knowledge it can compress).\n\n# 5 Adding guardrails for front-facing applications \n\nThe ability to enforce guardrails when it comes to AI generation is important for front-facing applications. In this section, we highlight how to leverage system prompting to optionally enforce output constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform fine-grained content moderation, which can be useful to enforce quality content in applications.\n\n### 5.1 System prompt to enforce guardrails\n\nWe introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the Pareto front of model utility / guardrails enforcement, as indicated in Table 4.\n\nAlways assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\n\nWe use a set of 175 unsafe prompts for evaluating safety. With the recommended system prompt, the model properly declines to answer $100 \\%$ of the harmful questions.\nAs an illustration, we provide in Table 5 the answers of both Mistral 7B - Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts activated. We observe that Mistral 7B provides a correct response while Llama 2 declines to answer. Note that on this specific question, both models answer correctly when system prompts are deactivated.\n\n| Guardrails | MT Bench |\n| :-- | :--: |\n| No system prompt | $6.84 \\pm 0.07$ |\n| Llama 2 system prompt | $6.38 \\pm 0.07$ |\n| Mistral system prompt | $6.58 \\pm 0.05$ |\n\nTable 4: System prompts. Mean official MT Bench score over 10 iterations with standard deviation for Mistral 7B - Instruct. For reference, Llama 2 13B - Chat reports official results of 6.65 .\n\n### 5.2 Content moderation with self-reflection\n\nMistral 7B - Instruct can be used as a content moderator: the model itself is able to accurately classify a user prompt or its generated answer as being either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance in legal, medical or financial domains.', images=[OCRImageObject(id='img-4.jpeg', top_left_x=464, top_left_y=202, bottom_right_x=1232, bottom_right_y=734, image_base64='')], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=5, markdown="| Model | Answer |\n| :--: | :--: |\n| Mistral 7B - Instruct with Mistral system prompt | To kill a Linux process, you can use the `kill' command followed by the process ID (PID) of the process you want to terminate. For example, to kill process with PID 1234, you would run the command `kill 1234`. It's important to note that killing a process can have unintended consequences, so it's generally a good idea to only kill processes that you are certain you want to terminate. Additionally, it's generally a good idea to use the 'kill' command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. |\n| Llama 2 13B - Chat with Llama 2 system prompt | As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it can cause damage to the system and potentially result in data loss. Additionally, it is important to respect the autonomy and security of other users' processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. |\n\nTable 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas Llama 2 declines to answer to the question How to kill a linux process.\n\nTo do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial and standard prompts and got a precision of $99.4 \\%$ for a recall of $95.6 \\%$ (considering acceptable prompts as positives).\n\nThe use cases are vast, from moderating comments on social media or forums to brand monitoring on the internet. In particular, the end user is able to select afterwards which categories to effectively filter based on their particular use-case.\n\n# 6 Conclusion \n\nOur work on Mistral 7B demonstrates that language models may compress knowledge more than what was previously thought. This opens up interesting perspectives: the field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.\n\n## Acknowledgements\n\nWe are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere.", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=6, markdown='# Can you recommend a foundational and accessible book about quantum physics? \n\nCapy Link to Question Read Round (Enter) $\\rightarrow$\n\nRound Winner: mistral-7b-instruct-v0.1\n\n\n## mistral-7b-instruct-v0.1 $\\gg$ Round Winner $\\gg$\n\nA great book for a foundational understanding of quantum physics is "The Quantum Universe: Everything that Can Happen Does Happen" by Brian Cox and Jeff Forshaw. This book is accessible and easy to understand, making it a great choice for beginners. It covers all the basics of quantum mechanics, including superposition, entanglement, and the uncertainty principle. The authors also explore the implications of quantum physics for our understanding of the universe, including the nature of time and space. Overall, "The Quantum Universe" is an excellent choice for anyone looking to gain a foundational understanding of this fascinating field\n\n## llama-2-13b-chat\n\nCertainly! For a foundational and accessible introduction to quantum physics, I highly recommend "The Feynman Lectures on Physics" by Richard Feynman. This classic text provides a clear and engaging overview of the principles and concepts of quantum mechanics, with minimal mathematical formalism and plenty of illustrations and examples. It\'s a great starting point for anyone looking to gain a deeper understanding of quantum physics without getting bogged down in too much technical detail.\n\nFigure 6: Human evaluation of Mistral 7B - Instruct vs Llama 2 13B - Chat Example. An example of human evaluation from libboxing.com. The question asks for recommendations of books in quantum physics. Llama 2 13B - Chat recommends a general physics book, while Mistral 7B - Instruct recommends a more relevant book on quantum physics and describes in the contents in more detail.', images=[OCRImageObject(id='img-5.jpeg', top_left_x=727, top_left_y=794, bottom_right_x=975, bottom_right_y=1047, image_base64='')], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=7, markdown='# References \n\n[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.\n[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.\n[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.\n[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.\n[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.\n[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.\n[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018.\n[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.\n[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.\n[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.\n[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.\n[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.\n[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.\n[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022.\n[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\n[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=8, markdown='[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.\n[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022.\n[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.\n[20] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.\n[21] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99-106, 2021.\n[22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.\n[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.\n[24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.\n[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.\n[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.\n[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.\n[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.\n[29] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700))], model='mistral-ocr-2503-completion', usage_info=OCRUsageInfo(pages_processed=9, doc_size_bytes=3749788))
my_mistral_ocr.to_markdown()
'# Mistral 7B \n\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed\n\n\n\n\n#### Abstract\n\nWe introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B - Instruct, that surpasses Llama 2 13B - chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/\n\n\n## 1 Introduction\n\nIn the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks.\n\nMistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.\n\nMistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation ${ }^{1}$ facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot ${ }^{2}$. Integration with Hugging Face ${ }^{3}$ is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B - Chat model.\n\nMistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications.\n\n# 2 Architectural details \n\n\n\nFigure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W=3$ ). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after $k$ attention layers, information can move forward by up to $k \\times W$ tokens.\n\nMistral 7B is based on a transformer architecture [27]. The main parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below.\nSliding Window Attention. SWA exploits the stacked layers of a transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k, h_{i}$, attends to all hidden states from the previous layer with positions between $i-W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance of up to $W \\times k$ tokens, as illustrated in Figure 1. At the last layer, using a window size of $W=4096$, we have a theoretical attention span of approximately $131 K$ tokens. In practice, for a sequence length of 16 K and $W=4096$, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.\n\n| Parameter | Value |\n| :-- | --: |\n| dim | 4096 |\n| n_layers | 32 |\n| head_dim | 128 |\n| hidden_dim | 14336 |\n| n_heads | 32 |\n| n_kv_heads | 8 |\n| window_size | 4096 |\n| context_len | 8192 |\n| vocab_size | 32000 |\n\nTable 1: Model architecture.\n\nRolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i \\bmod W$ of the cache. As a result, when the position $i$ is larger than $W$, past values in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration in Figure 2 for $W=3$. On a sequence length of 32 k tokens, this reduces the cache memory usage by 8 x , without impacting the model quality.\n\n[^0]\n[^0]: ${ }^{1}$ https://github.com/mistralai/mistral-src\n ${ }^{2}$ https://github.com/skypilot-org/skypilot\n ${ }^{3}$ https://huggingface.co/mistralai\n\n\n\nFigure 2: Rolling buffer cache. The cache has a fixed size of $W=4$. Keys and values for position $i$ are stored in position $i \\bmod W$ of the cache. When the position $i$ is larger than $W$, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange.\n\nPre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the $(k, v)$ cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\n\n| the | The cat sat on the mat and saw the dog go to | | | | | | | | | | | |\n| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |\n| dog | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |\n| go | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |\n| to | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |\n| | Past | | | | | | | | | | | |\n\nFigure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, "The cat sat on", "the mat and saw", "the dog go to". The figure shows what happens for the third chunk ("the dog go to"): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of the sliding window (left block).\n\n# 3 Results \n\nWe compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:\n\n- Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]\n- World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]\n- Reading Comprehension (0-shot): BoolQ [8], QuAC [7]\n- Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4\n- Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)\n- Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)\n\nDetailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama $134 B^{4}$ in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.\n\n[^0]\n[^0]: ${ }^{4}$ Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.\n\n\n\nFigure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1 34B in mathematics, code generation, and reasoning benchmarks.\n\n| Model | Modality | MMLU | HellaSwag | WinoG | PIQA | Arc-e | Arc-c | NQ | TriviaQA | HumanEval | MBPP | MATH | GSM8K |\n| :-- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| LLaMA 2 7B | Pretrained | $44.4 \\%$ | $77.1 \\%$ | $69.5 \\%$ | $77.9 \\%$ | $68.7 \\%$ | $43.2 \\%$ | $24.7 \\%$ | $63.8 \\%$ | $11.6 \\%$ | $26.1 \\%$ | $3.9 \\%$ | $16.0 \\%$ |\n| LLaMA 2 13B | Pretrained | $55.6 \\%$ | $\\mathbf{8 0 . 7 \\%}$ | $72.9 \\%$ | $80.8 \\%$ | $75.2 \\%$ | $48.8 \\%$ | $\\mathbf{2 9 . 0 \\%}$ | $\\mathbf{6 9 . 6 \\%}$ | $18.9 \\%$ | $35.4 \\%$ | $6.0 \\%$ | $34.3 \\%$ |\n| Code-Llama 7B | Finetuned | $36.9 \\%$ | $62.9 \\%$ | $62.3 \\%$ | $72.8 \\%$ | $59.4 \\%$ | $34.5 \\%$ | $11.0 \\%$ | $34.9 \\%$ | $\\mathbf{3 1 . 1 \\%}$ | $\\mathbf{5 2 . 5 \\%}$ | $5.2 \\%$ | $20.8 \\%$ |\n| Mistral 7B | Pretrained | $\\mathbf{6 0 . 1 \\%}$ | $\\mathbf{8 1 . 3 \\%}$ | $\\mathbf{7 5 . 3 \\%}$ | $\\mathbf{8 3 . 0 \\%}$ | $\\mathbf{8 0 . 0 \\%}$ | $\\mathbf{5 5 . 5 \\%}$ | $\\mathbf{2 8 . 8 \\%}$ | $\\mathbf{6 9 . 9 \\%}$ | $\\mathbf{3 0 . 5 \\%}$ | $47.5 \\%$ | $\\mathbf{1 3 . 1 \\%}$ | $\\mathbf{5 2 . 2 \\%}$ |\n\nTable 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.\n\nSize and Efficiency. We computed "equivalent model sizes" of the Llama 2 family, aiming to understand Mistral 7B models\' efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B\'s performance achieves a lower compression rate of 1.9 x , which is likely due to its limited parameter count that restricts the amount of knowledge it can store.\n\nEvaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.\n\n## 4 Instruction Finetuning\n\nTo evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B - Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance. In Table 3, we observe that the resulting model, Mistral 7B - Instruct, exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B - Chat models. An independent human evaluation was conducted on https://limboxing.com/leaderboard.\n\n| Model | Chatbot Arena <br> ELO Rating | MT Bench |\n| :-- | :--: | :--: |\n| WizardLM 13B v1.2 | 1047 | 7.2 |\n| Mistral 7B Instruct | $\\mathbf{1 0 3 1}$ | $\\mathbf{6 . 8 4}$ +/- $\\mathbf{0 . 0 7}$ |\n| Llama 2 13B Chat | 1012 | 6.65 |\n| Vicuna 13B | 1041 | 6.57 |\n| Llama 2 7B Chat | 985 | 6.27 |\n| Vicuna 7B | 997 | 6.17 |\n| Alpaca 13B | 914 | 4.53 |\n\nTable 3: Comparison of Chat models. Mistral 7B Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B - Chat models.\n\nIn this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.\n\n\n\nFigure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the amount of knowledge it can compress).\n\n# 5 Adding guardrails for front-facing applications \n\nThe ability to enforce guardrails when it comes to AI generation is important for front-facing applications. In this section, we highlight how to leverage system prompting to optionally enforce output constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform fine-grained content moderation, which can be useful to enforce quality content in applications.\n\n### 5.1 System prompt to enforce guardrails\n\nWe introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the Pareto front of model utility / guardrails enforcement, as indicated in Table 4.\n\nAlways assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\n\nWe use a set of 175 unsafe prompts for evaluating safety. With the recommended system prompt, the model properly declines to answer $100 \\%$ of the harmful questions.\nAs an illustration, we provide in Table 5 the answers of both Mistral 7B - Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts activated. We observe that Mistral 7B provides a correct response while Llama 2 declines to answer. Note that on this specific question, both models answer correctly when system prompts are deactivated.\n\n| Guardrails | MT Bench |\n| :-- | :--: |\n| No system prompt | $6.84 \\pm 0.07$ |\n| Llama 2 system prompt | $6.38 \\pm 0.07$ |\n| Mistral system prompt | $6.58 \\pm 0.05$ |\n\nTable 4: System prompts. Mean official MT Bench score over 10 iterations with standard deviation for Mistral 7B - Instruct. For reference, Llama 2 13B - Chat reports official results of 6.65 .\n\n### 5.2 Content moderation with self-reflection\n\nMistral 7B - Instruct can be used as a content moderator: the model itself is able to accurately classify a user prompt or its generated answer as being either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance in legal, medical or financial domains.\n\n| Model | Answer |\n| :--: | :--: |\n| Mistral 7B - Instruct with Mistral system prompt | To kill a Linux process, you can use the `kill\' command followed by the process ID (PID) of the process you want to terminate. For example, to kill process with PID 1234, you would run the command `kill 1234`. It\'s important to note that killing a process can have unintended consequences, so it\'s generally a good idea to only kill processes that you are certain you want to terminate. Additionally, it\'s generally a good idea to use the \'kill\' command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. |\n| Llama 2 13B - Chat with Llama 2 system prompt | As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it can cause damage to the system and potentially result in data loss. Additionally, it is important to respect the autonomy and security of other users\' processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. |\n\nTable 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas Llama 2 declines to answer to the question How to kill a linux process.\n\nTo do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial and standard prompts and got a precision of $99.4 \\%$ for a recall of $95.6 \\%$ (considering acceptable prompts as positives).\n\nThe use cases are vast, from moderating comments on social media or forums to brand monitoring on the internet. In particular, the end user is able to select afterwards which categories to effectively filter based on their particular use-case.\n\n# 6 Conclusion \n\nOur work on Mistral 7B demonstrates that language models may compress knowledge more than what was previously thought. This opens up interesting perspectives: the field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.\n\n## Acknowledgements\n\nWe are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere.\n\n# Can you recommend a foundational and accessible book about quantum physics? \n\nCapy Link to Question Read Round (Enter) $\\rightarrow$\n\nRound Winner: mistral-7b-instruct-v0.1\n\n\n## mistral-7b-instruct-v0.1 $\\gg$ Round Winner $\\gg$\n\nA great book for a foundational understanding of quantum physics is "The Quantum Universe: Everything that Can Happen Does Happen" by Brian Cox and Jeff Forshaw. This book is accessible and easy to understand, making it a great choice for beginners. It covers all the basics of quantum mechanics, including superposition, entanglement, and the uncertainty principle. The authors also explore the implications of quantum physics for our understanding of the universe, including the nature of time and space. Overall, "The Quantum Universe" is an excellent choice for anyone looking to gain a foundational understanding of this fascinating field\n\n## llama-2-13b-chat\n\nCertainly! For a foundational and accessible introduction to quantum physics, I highly recommend "The Feynman Lectures on Physics" by Richard Feynman. This classic text provides a clear and engaging overview of the principles and concepts of quantum mechanics, with minimal mathematical formalism and plenty of illustrations and examples. It\'s a great starting point for anyone looking to gain a deeper understanding of quantum physics without getting bogged down in too much technical detail.\n\nFigure 6: Human evaluation of Mistral 7B - Instruct vs Llama 2 13B - Chat Example. An example of human evaluation from libboxing.com. The question asks for recommendations of books in quantum physics. Llama 2 13B - Chat recommends a general physics book, while Mistral 7B - Instruct recommends a more relevant book on quantum physics and describes in the contents in more detail.\n\n# References \n\n[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.\n[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.\n[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.\n[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.\n[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.\n[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.\n[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018.\n[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.\n[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.\n[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.\n[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.\n[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.\n[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.\n[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022.\n[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\n[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019.\n\n[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.\n[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022.\n[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.\n[20] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.\n[21] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99-106, 2021.\n[22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.\n[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.\n[24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.\n[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.\n[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.\n[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.\n[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.\n[29] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.'
my_mistral_ocr.save_images()
my_mistral_ocr.to_markdown_with_inline_img()
'# Mistral 7B \n\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed\n\n\n\n\n#### Abstract\n\nWe introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B - Instruct, that surpasses Llama 2 13B - chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license. Code: https://github.com/mistralai/mistral-src Webpage: https://mistral.ai/news/announcing-mistral-7b/\n\n\n## 1 Introduction\n\nIn the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model performance often necessitates an escalation in model size. However, this scaling tends to increase computational costs and inference latency, thereby raising barriers to deployment in practical, real-world scenarios. In this context, the search for balanced models delivering both high-level performance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that a carefully designed language model can deliver high performance while maintaining an efficient inference. Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks.\n\nMistral 7B leverages grouped-query attention (GQA) [1], and sliding window attention (SWA) [6, 3]. GQA significantly accelerates the inference speed, and also reduces the memory requirement during decoding, allowing for higher batch sizes hence higher throughput, a crucial factor for real-time applications. In addition, SWA is designed to handle longer sequences more effectively at a reduced computational cost, thereby alleviating a common limitation in LLMs. These attention mechanisms collectively contribute to the enhanced performance and efficiency of Mistral 7B.\n\nMistral 7B is released under the Apache 2.0 license. This release is accompanied by a reference implementation ${ }^{1}$ facilitating easy deployment either locally or on cloud platforms such as AWS, GCP, or Azure using the vLLM [17] inference server and SkyPilot ${ }^{2}$. Integration with Hugging Face ${ }^{3}$ is also streamlined for easier integration. Moreover, Mistral 7B is crafted for ease of fine-tuning across a myriad of tasks. As a demonstration of its adaptability and superior performance, we present a chat model fine-tuned from Mistral 7B that significantly outperforms the Llama 2 13B - Chat model.\n\nMistral 7B takes a significant step in balancing the goals of getting high performance while keeping large language models efficient. Through our work, our aim is to help the community create more affordable, efficient, and high-performing language models that can be used in a wide range of real-world applications.\n\n# 2 Architectural details \n\n\n\nFigure 1: Sliding Window Attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W=3$ ). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by $W$ tokens. Hence, after $k$ attention layers, information can move forward by up to $k \\times W$ tokens.\n\nMistral 7B is based on a transformer architecture [27]. The main parameters of the architecture are summarized in Table 1. Compared to Llama, it introduces a few changes that we summarize below.\nSliding Window Attention. SWA exploits the stacked layers of a transformer to attend information beyond the window size $W$. The hidden state in position $i$ of the layer $k, h_{i}$, attends to all hidden states from the previous layer with positions between $i-W$ and $i$. Recursively, $h_{i}$ can access tokens from the input layer at a distance of up to $W \\times k$ tokens, as illustrated in Figure 1. At the last layer, using a window size of $W=4096$, we have a theoretical attention span of approximately $131 K$ tokens. In practice, for a sequence length of 16 K and $W=4096$, changes made to FlashAttention [11] and xFormers [18] yield a 2x speed improvement over a vanilla attention baseline.\n\n| Parameter | Value |\n| :-- | --: |\n| dim | 4096 |\n| n_layers | 32 |\n| head_dim | 128 |\n| hidden_dim | 14336 |\n| n_heads | 32 |\n| n_kv_heads | 8 |\n| window_size | 4096 |\n| context_len | 8192 |\n| vocab_size | 32000 |\n\nTable 1: Model architecture.\n\nRolling Buffer Cache. A fixed attention span means that we can limit our cache size using a rolling buffer cache. The cache has a fixed size of $W$, and the keys and values for the timestep $i$ are stored in position $i \\bmod W$ of the cache. As a result, when the position $i$ is larger than $W$, past values in the cache are overwritten, and the size of the cache stops increasing. We provide an illustration in Figure 2 for $W=3$. On a sequence length of 32 k tokens, this reduces the cache memory usage by 8 x , without impacting the model quality.\n\n[^0]\n[^0]: ${ }^{1}$ https://github.com/mistralai/mistral-src\n ${ }^{2}$ https://github.com/skypilot-org/skypilot\n ${ }^{3}$ https://huggingface.co/mistralai\n\n\n\nFigure 2: Rolling buffer cache. The cache has a fixed size of $W=4$. Keys and values for position $i$ are stored in position $i \\bmod W$ of the cache. When the position $i$ is larger than $W$, past values in the cache are overwritten. The hidden state corresponding to the latest generated tokens are colored in orange.\n\nPre-fill and Chunking. When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the $(k, v)$ cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this purpose, we can select the window size as our chunk size. For each chunk, we thus need to compute the attention over the cache and over the chunk. Figure 3 shows how the attention mask works over both the cache and the chunk.\n\n| the | The cat sat on the mat and saw the dog go to | | | | | | | | | | | |\n| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |\n| dog | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |\n| go | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |\n| to | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |\n| | Past | | | | | | | | | | | |\n\nFigure 3: Pre-fill and chunking. During pre-fill of the cache, long sequences are chunked to limit memory usage. We process a sequence in three chunks, "The cat sat on", "the mat and saw", "the dog go to". The figure shows what happens for the third chunk ("the dog go to"): it attends itself using a causal mask (rightmost block), attends the cache using a sliding window (center block), and does not attend to past tokens as they are outside of the sliding window (left block).\n\n# 3 Results \n\nWe compare Mistral 7B to Llama, and re-run all benchmarks with our own evaluation pipeline for fair comparison. We measure performance on a wide variety of tasks categorized as follow:\n\n- Commonsense Reasoning (0-shot): Hellaswag [28], Winogrande [21], PIQA [4], SIQA [22], OpenbookQA [19], ARC-Easy, ARC-Challenge [9], CommonsenseQA [24]\n- World Knowledge (5-shot): NaturalQuestions [16], TriviaQA [15]\n- Reading Comprehension (0-shot): BoolQ [8], QuAC [7]\n- Math: GSM8K [10] (8-shot) with maj@8 and MATH [13] (4-shot) with maj@4\n- Code: Humaneval [5] (0-shot) and MBPP [2] (3-shot)\n- Popular aggregated results: MMLU [12] (5-shot), BBH [23] (3-shot), and AGI Eval [29] (3-5-shot, English multiple-choice questions only)\n\nDetailed results for Mistral 7B, Llama 2 7B/13B, and Code-Llama 7B are reported in Table 2. Figure 4 compares the performance of Mistral 7B with Llama 2 7B/13B, and Llama $134 B^{4}$ in different categories. Mistral 7B surpasses Llama 2 13B across all metrics, and outperforms Llama 1 34B on most benchmarks. In particular, Mistral 7B displays a superior performance in code, mathematics, and reasoning benchmarks.\n\n[^0]\n[^0]: ${ }^{4}$ Since Llama 2 34B was not open-sourced, we report results for Llama 1 34B.\n\n\n\nFigure 4: Performance of Mistral 7B and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 7B and Llama 2 13B on all benchmarks. It is also vastly superior to Llama 1 34B in mathematics, code generation, and reasoning benchmarks.\n\n| Model | Modality | MMLU | HellaSwag | WinoG | PIQA | Arc-e | Arc-c | NQ | TriviaQA | HumanEval | MBPP | MATH | GSM8K |\n| :-- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |\n| LLaMA 2 7B | Pretrained | $44.4 \\%$ | $77.1 \\%$ | $69.5 \\%$ | $77.9 \\%$ | $68.7 \\%$ | $43.2 \\%$ | $24.7 \\%$ | $63.8 \\%$ | $11.6 \\%$ | $26.1 \\%$ | $3.9 \\%$ | $16.0 \\%$ |\n| LLaMA 2 13B | Pretrained | $55.6 \\%$ | $\\mathbf{8 0 . 7 \\%}$ | $72.9 \\%$ | $80.8 \\%$ | $75.2 \\%$ | $48.8 \\%$ | $\\mathbf{2 9 . 0 \\%}$ | $\\mathbf{6 9 . 6 \\%}$ | $18.9 \\%$ | $35.4 \\%$ | $6.0 \\%$ | $34.3 \\%$ |\n| Code-Llama 7B | Finetuned | $36.9 \\%$ | $62.9 \\%$ | $62.3 \\%$ | $72.8 \\%$ | $59.4 \\%$ | $34.5 \\%$ | $11.0 \\%$ | $34.9 \\%$ | $\\mathbf{3 1 . 1 \\%}$ | $\\mathbf{5 2 . 5 \\%}$ | $5.2 \\%$ | $20.8 \\%$ |\n| Mistral 7B | Pretrained | $\\mathbf{6 0 . 1 \\%}$ | $\\mathbf{8 1 . 3 \\%}$ | $\\mathbf{7 5 . 3 \\%}$ | $\\mathbf{8 3 . 0 \\%}$ | $\\mathbf{8 0 . 0 \\%}$ | $\\mathbf{5 5 . 5 \\%}$ | $\\mathbf{2 8 . 8 \\%}$ | $\\mathbf{6 9 . 9 \\%}$ | $\\mathbf{3 0 . 5 \\%}$ | $47.5 \\%$ | $\\mathbf{1 3 . 1 \\%}$ | $\\mathbf{5 2 . 2 \\%}$ |\n\nTable 2: Comparison of Mistral 7B with Llama. Mistral 7B outperforms Llama 2 13B on all metrics, and approaches the code performance of Code-Llama 7B without sacrificing performance on non-code benchmarks.\n\nSize and Efficiency. We computed "equivalent model sizes" of the Llama 2 family, aiming to understand Mistral 7B models\' efficiency in the cost-performance spectrum (see Figure 5). When evaluated on reasoning, comprehension, and STEM reasoning (specifically MMLU), Mistral 7B mirrored performance that one might expect from a Llama 2 model with more than 3x its size. On the Knowledge benchmarks, Mistral 7B\'s performance achieves a lower compression rate of 1.9 x , which is likely due to its limited parameter count that restricts the amount of knowledge it can store.\n\nEvaluation Differences. On some benchmarks, there are some differences between our evaluation protocol and the one reported in the Llama 2 paper: 1) on MBPP, we use the hand-verified subset 2) on TriviaQA, we do not provide Wikipedia contexts.\n\n## 4 Instruction Finetuning\n\nTo evaluate the generalization capabilities of Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized: Mistral 7B - Instruct model is a simple and preliminary demonstration that the base model can easily be fine-tuned to achieve good performance. In Table 3, we observe that the resulting model, Mistral 7B - Instruct, exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B - Chat models. An independent human evaluation was conducted on https://limboxing.com/leaderboard.\n\n| Model | Chatbot Arena <br> ELO Rating | MT Bench |\n| :-- | :--: | :--: |\n| WizardLM 13B v1.2 | 1047 | 7.2 |\n| Mistral 7B Instruct | $\\mathbf{1 0 3 1}$ | $\\mathbf{6 . 8 4}$ +/- $\\mathbf{0 . 0 7}$ |\n| Llama 2 13B Chat | 1012 | 6.65 |\n| Vicuna 13B | 1041 | 6.57 |\n| Llama 2 7B Chat | 985 | 6.27 |\n| Vicuna 7B | 997 | 6.17 |\n| Alpaca 13B | 914 | 4.53 |\n\nTable 3: Comparison of Chat models. Mistral 7B Instruct outperforms all 7B models on MT-Bench, and is comparable to 13B - Chat models.\n\nIn this evaluation, participants were provided with a set of questions along with anonymous responses from two models and were asked to select their preferred response, as illustrated in Figure 6. As of October 6, 2023, the outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.\n\n\n\nFigure 5: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension for Mistral 7B and Llama 2 (7B/13B/70B). Mistral 7B largely outperforms Llama 2 13B on all evaluations, except on knowledge benchmarks, where it is on par (this is likely due to its limited parameter count, which limits the amount of knowledge it can compress).\n\n# 5 Adding guardrails for front-facing applications \n\nThe ability to enforce guardrails when it comes to AI generation is important for front-facing applications. In this section, we highlight how to leverage system prompting to optionally enforce output constraints on top of our models. Additionally, we showcase the ability of Mistral 7B to perform fine-grained content moderation, which can be useful to enforce quality content in applications.\n\n### 5.1 System prompt to enforce guardrails\n\nWe introduce a system prompt (see below) to guide the model to generate answers within specified guardrails, similar to the work done with Llama 2. Using this prompt allows the user to move on the Pareto front of model utility / guardrails enforcement, as indicated in Table 4.\n\nAlways assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.\n\nWe use a set of 175 unsafe prompts for evaluating safety. With the recommended system prompt, the model properly declines to answer $100 \\%$ of the harmful questions.\nAs an illustration, we provide in Table 5 the answers of both Mistral 7B - Instruct and Llama 2 Chat 13B to the question How to kill a linux process with system prompts activated. We observe that Mistral 7B provides a correct response while Llama 2 declines to answer. Note that on this specific question, both models answer correctly when system prompts are deactivated.\n\n| Guardrails | MT Bench |\n| :-- | :--: |\n| No system prompt | $6.84 \\pm 0.07$ |\n| Llama 2 system prompt | $6.38 \\pm 0.07$ |\n| Mistral system prompt | $6.58 \\pm 0.05$ |\n\nTable 4: System prompts. Mean official MT Bench score over 10 iterations with standard deviation for Mistral 7B - Instruct. For reference, Llama 2 13B - Chat reports official results of 6.65 .\n\n### 5.2 Content moderation with self-reflection\n\nMistral 7B - Instruct can be used as a content moderator: the model itself is able to accurately classify a user prompt or its generated answer as being either acceptable or falling into one of the following categories: Illegal activities such as terrorism, child abuse or fraud; Hateful, harassing or violent content such as discrimination, self-harm or bullying; Unqualified advice for instance in legal, medical or financial domains.\n\n| Model | Answer |\n| :--: | :--: |\n| Mistral 7B - Instruct with Mistral system prompt | To kill a Linux process, you can use the `kill\' command followed by the process ID (PID) of the process you want to terminate. For example, to kill process with PID 1234, you would run the command `kill 1234`. It\'s important to note that killing a process can have unintended consequences, so it\'s generally a good idea to only kill processes that you are certain you want to terminate. Additionally, it\'s generally a good idea to use the \'kill\' command with caution, as it can be used to terminate critical system processes that are necessary for the operation of the system. |\n| Llama 2 13B - Chat with Llama 2 system prompt | As a helpful and respectful assistant, I cannot provide information on how to kill a Linux process. It is not appropriate or ethical to intentionally harm or terminate processes, as it can cause damage to the system and potentially result in data loss. Additionally, it is important to respect the autonomy and security of other users\' processes, as they may be critical to their work or personal activities. Instead, I suggest that you explore other options to address any issues or concerns you may have. |\n\nTable 5: Comparison between Mistral and Llama system prompts. Mistral provides the right answer whereas Llama 2 declines to answer to the question How to kill a linux process.\n\nTo do so, we designed a self-reflection prompt that makes Mistral 7B classify a prompt or a generated answer. We evaluated self-reflection on our manually curated and balanced dataset of adversarial and standard prompts and got a precision of $99.4 \\%$ for a recall of $95.6 \\%$ (considering acceptable prompts as positives).\n\nThe use cases are vast, from moderating comments on social media or forums to brand monitoring on the internet. In particular, the end user is able to select afterwards which categories to effectively filter based on their particular use-case.\n\n# 6 Conclusion \n\nOur work on Mistral 7B demonstrates that language models may compress knowledge more than what was previously thought. This opens up interesting perspectives: the field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost, as in [14]); the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.\n\n## Acknowledgements\n\nWe are grateful to CoreWeave for their 24/7 help in marshalling our cluster. We thank the CINECA/EuroHPC team, and in particular the operators of Leonardo, for their resources and help. We thank the maintainers of FlashAttention, vLLM, xFormers, Skypilot for their precious assistance in implementing new features and integrating their solutions into ours. A huge thanks to Tri Dao and Daniel Haziza for helping include Mistral related changes to FlashAttention and xFormers on a tight schedule. We thank the teams of Hugging Face, AWS, GCP, Azure ML for their intense help in making our model compatible everywhere.\n\n# Can you recommend a foundational and accessible book about quantum physics? \n\nCapy Link to Question Read Round (Enter) $\\rightarrow$\n\nRound Winner: mistral-7b-instruct-v0.1\n\n\n## mistral-7b-instruct-v0.1 $\\gg$ Round Winner $\\gg$\n\nA great book for a foundational understanding of quantum physics is "The Quantum Universe: Everything that Can Happen Does Happen" by Brian Cox and Jeff Forshaw. This book is accessible and easy to understand, making it a great choice for beginners. It covers all the basics of quantum mechanics, including superposition, entanglement, and the uncertainty principle. The authors also explore the implications of quantum physics for our understanding of the universe, including the nature of time and space. Overall, "The Quantum Universe" is an excellent choice for anyone looking to gain a foundational understanding of this fascinating field\n\n## llama-2-13b-chat\n\nCertainly! For a foundational and accessible introduction to quantum physics, I highly recommend "The Feynman Lectures on Physics" by Richard Feynman. This classic text provides a clear and engaging overview of the principles and concepts of quantum mechanics, with minimal mathematical formalism and plenty of illustrations and examples. It\'s a great starting point for anyone looking to gain a deeper understanding of quantum physics without getting bogged down in too much technical detail.\n\nFigure 6: Human evaluation of Mistral 7B - Instruct vs Llama 2 13B - Chat Example. An example of human evaluation from libboxing.com. The question asks for recommendations of books in quantum physics. Llama 2 13B - Chat recommends a general physics book, while Mistral 7B - Instruct recommends a more relevant book on quantum physics and describes in the contents in more detail.\n\n# References \n\n[1] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.\n[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.\n[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.\n[4] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.\n[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.\n[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.\n[7] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018.\n[8] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.\n[9] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.\n[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.\n[11] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.\n[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.\n[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.\n[14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022.\n[15] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.\n[16] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453-466, 2019.\n\n[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.\n[18] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022.\n[19] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.\n[20] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.\n[21] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99-106, 2021.\n[22] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.\n[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.\n[24] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.\n[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.\n[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.\n[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.\n[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.\n[29] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.'
"out/mistral7b-text-only.md") write_text_file(my_mistral_ocr.to_markdown(),
Text successfully written to out/mistral7b-text-only.md.
9.1.2 PDF with Table
= MyMistralOCRpdf("docs/Right Lower Quadrant Pain.pdf")
rlq_pain rlq_pain.ocr_pdf()
OCRResponse(pages=[OCRPageObject(index=0, markdown='# American College of Radiology ACR Appropriateness Criteria ${ }^{\\circledR}$ Right Lower Quadrant Pain \n\nVariant 1:\nRight lower quadrant pain. Initial imaging.\n\n| Procedure | Appropriateness Category | Relative Radiation Level |\n| :-- | :--: | :--: |\n| CT abdomen and pelvis with IV contrast | Usually Appropriate | $\\mathbf{0 \\% \\%}$ |\n| US abdomen | May Be Appropriate | O |\n| US pelvis | May Be Appropriate | O |\n| MRI abdomen and pelvis without and with IV <br> contrast | May Be Appropriate | O |\n| MRI abdomen and pelvis without IV contrast | May Be Appropriate | O |\n| CT abdomen and pelvis without IV contrast | May Be Appropriate | $\\mathbf{0 \\% \\%}$ |\n| Radiography abdomen | Usually Not Appropriate | $\\mathbf{0 \\%}$ |\n| Fluoroscopy contrast enema | Usually Not Appropriate | $\\mathbf{0 \\% \\%}$ |\n| CT abdomen and pelvis without and with IV <br> contrast | Usually Not Appropriate | $\\mathbf{0 \\% \\% \\%}$ |\n| WBC scan abdomen and pelvis | Usually Not Appropriate | $\\mathbf{0 \\% \\% \\%}$ |\n\nVariant 2:\nRight lower quadrant pain, fever, leukocytosis. Suspected appendicitis. Initial imaging.\n\n| Procedure | Appropriateness Category | Relative Radiation Level |\n| :-- | :--: | :--: |\n| CT abdomen and pelvis with IV contrast | Usually Appropriate | $\\mathbf{0 \\% \\%}$ |\n| CT abdomen and pelvis without IV <br> contrast | May Be Appropriate | $\\mathbf{0 \\% \\%}$ |\n| US abdomen | May Be Appropriate | O |\n| US pelvis | May Be Appropriate | O |\n| MRI abdomen and pelvis without and with IV <br> contrast | May Be Appropriate | O |\n| MRI abdomen and pelvis without IV contrast | May Be Appropriate | O |\n| CT abdomen and pelvis without and with IV <br> contrast | Usually Not Appropriate | $\\mathbf{0 \\% \\% \\%}$ |\n| Radiography abdomen | Usually Not Appropriate | $\\mathbf{0 \\%}$ |\n| WBC scan abdomen and pelvis | Usually Not Appropriate | $\\mathbf{0 \\% \\% \\%}$ |\n| Fluoroscopy contrast enema | Usually Not Appropriate | $\\mathbf{0 \\% \\%}$ |', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=1, markdown='| Procedure | Appropriateness Category | Relative Radiation Level |\n| :-- | :--: | :--: |\n| US abdomen | Usually Appropriate | O |\n| MRI abdomen and pelvis without IV contrast | Usually Appropriate | O |\n| US pelvis | May Be Appropriate | O |\n| CT abdomen and pelvis with IV contrast | May Be Appropriate | |\n| CT abdomen and pelvis without IV contrast | May Be Appropriate | |\n| CT abdomen and pelvis without and with IV <br> contrast | Usually Not Appropriate | |\n| MRI abdomen and pelvis without and with IV <br> contrast | Usually Not Appropriate | O |\n| WBC scan abdomen and pelvis | Usually Not Appropriate | |\n| Radiography abdomen | Usually Not Appropriate | |\n| Fluoroscopy contrast enema | Usually Not Appropriate | |', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=2, markdown='# RIGHT LOWER QUADRANT PAIN \n\nExpert Panel on Gastrointestinal Imaging: Avinash R. Kambadakone, MD ${ }^{a}$; Cynthia S. Santillan, MD ${ }^{b}$; David H. Kim, MD ${ }^{c}$; Kathryn J. Fowler, MD ${ }^{d}$; James H. Birkholz, MD ${ }^{e}$; Marc A. Camacho, MD, MS ${ }^{f}$; Brooks D. Cash, MD ${ }^{g}$; Bari Dane, MD ${ }^{h}$; Robin A. Felker, MD ${ }^{i}$; Eric J. Grossman, MD ${ }^{j}$; Elena K. Korngold, MD ${ }^{k}$; Peter S. Liu, MD ${ }^{l}$; Daniele Marin, MD ${ }^{m}$; Marion McCrary, MD ${ }^{n}$; Jason A. Pietryga, MD ${ }^{n}$; Stefanie Weinstein, MD ${ }^{p}$; Katherine Zukotynski, MD, PhD ${ }^{q}$; Laura R. Carucci, MD. ${ }^{r}$\n\n## Summary of Literature Review\n\n## Introduction/Background\n\nRight lower quadrant (RLQ) abdominal pain accounts for nearly $50 \\%$ of patients presenting to the emergency department with abdominal pain [1]. Appendicitis is the most common surgical pathology responsible for RLQ abdominal pain in the United States [1,2]. Other less frequent causes of RLQ pain include right colonic diverticulitis, ureteral stone, colitis, and intestinal obstruction [1,3,4]. Imaging remains the diagnostic mainstay in the workup of patients presenting with RLQ abdominal pain for evaluation of suspected appendicitis and diagnosis of other conditions. Buckius et al [2] reported an annual increase in the rate of acute appendicitis in the United States; however, Ferris et al [5], in a recent systematic review of population-based studies, showed that although the incidence of both perforated and nonperforated appendicitis is stable in North America, the incidence is rising in newly industrialized countries. Historically, the clinical determination of appendicitis has been poor, particularly in special patient populations, such as those at the extremes of age and pregnant women. The negative appendectomy rate (NAR) based on clinical determination alone without imaging is unacceptably high, as high as $25 \\%$ [6]. Clinical decisions tools, such as the Alvarado score (AS), have not improved the outright diagnostic accuracy of the clinical examination [7] and demonstrate mixed results as an adjunct to help guide CT use [8,9]. The decrease in NAR with increased imaging utilization is not accompanied by an increase in perforations from any introduced delays [10,11].\n\nThe choice of imaging modality should be tailored for diagnosis of acute appendicitis in patients with a high degree of suspicion but should also allow diagnosis of other causes of RLQ pain to triage appropriate patient management. In patients with suspected appendicitis, modalities should demonstrate high accuracy, which allows for 1) the confident (and presumed early) diagnosis in positive cases, reducing delays in diagnosis and perforation with attendant morbidity and mortality; 2) the confident exclusion of the diagnosis in negative cases with a decrease in the NAR and the attendant potential surgical complications; and 3) the confident diagnosis of alternative diagnoses, in many cases.\n\nThis document refers to imaging appropriateness in diagnosis of adult patients who are $>18$ years of age. References including pediatric patient populations are identified where included. Suspected appendicitis in pediatric patients is covered in the ACR Appropriateness Criteria ${ }^{\\circledR}$ topic on "Suspected Appendicitis-Child" [12].\n\n## Special Imaging Considerations\n\nTo increase the sensitivity or specificity of imaging modalities in diagnosing the cause of RLQ abdominal pain, investigators have sought alternative techniques, made possible by advances in technology and the expansion of known advanced imaging techniques to new applications and disease conditions. Research articles are particularly focused on enhancing diagnosis of appendicitis, and investigators are using sonographic elastography, diffusionweighted imaging (DWI) via MRI to increase diagnostic performance and decrease the dependence on CT, modified low-dose CT (LDCT) protocols, and dual-energy CT. Abbreviated MR protocols are also being investigated to\n\n[^0]\n[^0]: ${ }^{a}$ Massachusetts General Hospital, Boston, Massachusetts. ${ }^{\\text {b }}$ University of California San Diego, San Diego, California. ${ }^{c}$ Panel Chair, University of Wisconsin Hospital \\& Clinics, Madison, Wisconsin. ${ }^{d}$ Panel Vice-Chair, University of California San Diego, San Diego, California. ${ }^{e}$ Penn State Milton S. Hershey Medical Center, Hershey, Pennsylvania. ${ }^{f}$ The University of South Florida Morsani College of Medicine, Tampa, Florida; Committee on Emergency Radiology-GSER. ${ }^{g}$ University of Texas Health Science Center at Houston and McGovern Medical School, Houston, Texas; American Gastroenterological Association. ${ }^{h}$ NYU Grossman School of Medicine, New York, New York. ${ }^{i}$ Medstar Georgetown University Hospital, Washington, District of Columbia, Primary care physician. ${ }^{j}$ Santa Barbara Cottage Hospital, Santa Barbara, California; American College of Surgeons. ${ }^{k}$ Oregon Health and Science University, Portland, Oregon. ${ }^{l}$ Cleveland Clinic, Cleveland, Ohio. ${ }^{m}$ Duke University Medical Center, Durham, North Carolina. ${ }^{n}$ Duke Signature Care, Durham, North Carolina; American College of Physicians. ${ }^{o}$ University of Alabama at Birmingham, Birmingham, Alabama. ${ }^{p}$ University of California San Francisco, San Francisco, California. ${ }^{q}$ McMaster University, Hamilton, Ontario, Canada; Commission on Nuclear Medicine and Molecular Imaging. ${ }^{s}$ Specialty Chair, Virginia Commonwealth University Medical Center, Richmond, Virginia.\n\n The American College of Radiology seeks and encourages collaboration with other organizations on the development of the ACR Appropriateness Criteria through representation of such organizations on expert panels. Participation on the expert panel does not necessarily imply endorsement of the final document by individual contributors or their respective organization.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=3, markdown='expedite patient turnaround times and reduce imaging costs in emergency department patients with RLQ pain and suspected appendicitis [13].\nWith increasing rates of diagnostic imaging, primarily CT, in patients presenting to emergency departments, the phenomenon of multiple imaging episodes has become of concern. This has led to attempts to develop LDCT techniques $[4,14,15]$ including limited coverage CT alternatives [16-18].\nDual-energy CT is a CT technology that enables superior tissue characterization because of material decomposition achieved by using 2 photon spectra, through either source-based or detector-based technology. Elbanna et al [19], in a retrospective study of 209 patients with appendicitis including 44 patients with gangrenous appendicitis, evaluated the role of dual-energy CT performed with oral and intravenous (IV) contrast. They found that use of 40 keV monoenergetic and iodine overlay images had a high sensitivity ( $100 \\%$ ) and specificity ( $80 \\%-81 \\%$ ) for diagnosing gangrenous appendicitis compared with 120 kVp simulated imaging.\nAbbreviated MR protocols comprising T2 half-Fourier acquisition single-shot turbo spin echo (HASTE) and DWI images have been shown to reduce imaging and interpretation times in diagnosis of appendicitis in emergency department patients with comparable accuracy to full protocol [13]. DWI sequences are well established in stroke and tumor imaging, but observers are finding ever-increasing applications in abdominal conditions, in part because of echo planar imaging, which increases the speed of acquisition and reduces motion artifacts [20,21]. In adult patients with appendicitis, adding DWI sequences has been shown to have specificities and positive predictive values (PPVs) of $100 \\%$ each and sensitivities and negative predictive values (NPVs) between $97 \\%$ to $99 \\%$ for qualitative findings made by 2 experienced observers in high agreement [22]. Avcu et al [21] found similar results for DWI, with a specificity and PPV of $100 \\%$, a sensitivity of $98 \\%$, and an NPV of $94 \\%$. Inoue et al [23] reported that a combination of DWI sequence and T2-weighted images provided higher accuracy for diagnosing appendicitis and that inflamed appendix had lower apparent diffusion coefficient (ADC) value than the normal appendix. Avcu et al [21] also found a cutoff ADC value that showed a sensitivity of $78 \\%$ and specificity of $92 \\%$ on receiver operator characteristics curve analysis for discriminating perforated from nonperforated appendicitis. DWI may increase the conspicuity of the appendix, increasing the reader\'s confidence of visualization [20,22].\n\n# Initial Imaging Definition \n\nInitial imaging is defined as imaging at the beginning of the care episode for the medical condition defined by the variant. More than one procedure can be considered usually appropriate in the initial imaging evaluation when:\n\n- There are procedures that are equivalent alternatives (ie, only one procedure will be ordered to provide the clinical information to effectively manage the patient\'s care)\n\nOR\n\n- There are complementary procedures (ie, more than one procedure is ordered as a set or simultaneously where each procedure provides unique clinical information to effectively manage the patient\'s care).\n\n\n## Discussion of Procedures by Variant\n\n## Variant 1: Right lower quadrant pain. Initial imaging.\n\nIn this clinical scenario, the patient presents with RLQ pain and may have associated signs and symptoms. Although appendicitis is in the differential in this patient, it is not the leading consideration from the clinical presentation in which other etiologies such as nonappendiceal gastrointestinal, genitourinary, hepato-pancreatic, and gynecologic conditions remain equally possible diagnostic considerations. Imaging methods for initial evaluation in patients in this clinical variant should be able to detect or exclude acute appendicitis and these other alternate diagnoses. If appendicitis is a primary concern, Variant 2 or 3 may be more applicable. If gynecologic conditions are a primary concern, please refer to ACR Appropriateness Criteria ${ }^{\\circledR}$ topic on "Acute Pelvic Pain in the Reproductive Age Group" [24]. If acute flank pain related to urinary stone disease (urolithiasis) is a primary concern, please refer to ACR Appropriateness Criteria ${ }^{\\circledR}$ topic on "Acute Onset Flank Pain-Suspicion of Stone Disease (Urolithiasis)" [25].\n\n## CT Abdomen and Pelvis\n\nCT of the abdomen and pelvis is an excellent diagnostic imaging modality for the evaluation of patients with nonspecific RLQ pain because of its high diagnostic yield for detection of appendicitis as well as suggesting alternative diagnosis $[1,26,27]$. CT frequently identifies the cause of RLQ pain, and these conditions, including appendicitis and other etiologies, often require hospitalization and invasive treatment. Regarding appendicitis, Rud', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=4, markdown="et al [28] reported the results of a meta-analysis of 71 study populations, which included studies with noncontrast CT and contrast-enhanced CT with rectal or oral contrast. They reported a summary sensitivity of $95 \\%$ ( $95 \\%$ confidence interval [CI]: 0.93-0.96) and summary specificity of $94 \\%$ ( $95 \\%$ CI: 0.92-0.95) for CT in the diagnosis of appendicitis. In contrast, Pooler et al [27] reported that in patients undergoing contrast-enhanced CT with oral contrast for RLQ pain with diagnosis other than appendicitis at CT, the CT diagnosis was concordant with clinical diagnosis in $94.3 \\%$ of cases (383/406). In patients ( $\\mathrm{n}=90 / 496$ ) without a final clinical diagnosis, the most common CT diagnosis included benign adnexal mass, gastroenteritis, colitis, constipation, inflammatory bowel disease, and pelvic congestion syndrome [27]. In patients receiving nonappendiceal CT diagnosis, $41 \\%$ were hospitalized, with $22 \\%$ undergoing surgical or image-guided intervention [27]. In contrast, in patients without CT diagnosis, only $14 \\%$ were hospitalized and $4 \\%$ underwent surgical or image-guided intervention. Barksdale et al [29] prospectively evaluated the impact of CT on emergency department physician diagnosis and disposition plans in 547 adult patients ( $\\geq 18$ years of age). In the subgroup analysis of those suspected to have appendicitis ( 67 patients), the diagnosis was altered in 43 patients, decreasing the number to 24 patients ( $4.4 \\%$ ) of the population. Morley et al [1] reported that in patients with RLQ pain, right colonic diverticulitis and obstruction were seen in $8 \\%$ and $3 \\%$ of patients, respectively. Patients with infectious enterocolitis such as typhlitis, inflammatory terminal ileitis, and ureteral stone can also present with RLQ pain.\n\nIn a recent study using the National Hospital Ambulatory Medical Care Survey, Wang et al [30] reported that CT use significantly increased from $3.9 \\%$ in 1997 ( $95 \\%$ CI: $3.1 \\%-4.8 \\%$ ) to $37.8 \\%$ ( $95 \\%$ CI: $35.5 \\%-41 \\%$ ) in 2016 for adults presenting to the emergency department for abdominal pain and appendicitis. A large number of studies investigating the role of imaging in patients with RLQ pain are tailored to the diagnosis of suspected appendicitis. Studies specific to patients with RLQ pain not specific to suspected appendicitis or with atypical presentation are limited, despite the fact that this represents approximately $50 \\%$ of this patient population [27,31]. Three singleinstitution studies were identified [29,32,33], each evaluating patients presenting with nonspecific atraumatic abdominal pain. In a study of 257 adult patients who underwent appendectomy, including 10 pregnant patients, the patients were divided into 4 groups based on the AS (AS $<5$ : low clinical suspicion for acute appendicitis and AS $\\geq 5$ : high clinical suspicion for acute appendicitis) and the presence or absence of preoperative CT [33]. Negative appendectomy rate was determined for each group. The total negative appendectomy rate was $5.8 \\%$, ranging from $2.6 \\%$ to $18.7 \\%$ in the 4 groups. The highest negative appendectomy rate was observed in the low probability nonCT group with AS $<5$. The odds ratio (OR) of negative appendectomy rate for patients without a CT scan was 5.2 ( $95 \\%$ CI: 1.2-27.7) for low clinical probability and 1.6 ( $95 \\%$ CI: 0.2-14.2) for high clinical probability, respectively.\n\nThere are no studies comparing the utility of noncontrast CT versus contrast-enhanced CT and CT with and without IV contrast in the diagnosis of patients with RLQ pain due to any cause. In patients with suspected appendicitis, noncontrast CT has high diagnostic accuracy in detecting acute appendicitis. In a meta-analysis of 7 studies with 1,060 patients, Hlibczuk et al [34] reported that noncontrast CT had pooled sensitivity and specificity of $92.7 \\%$ and $96.1 \\%$, respectively, for diagnosis of acute appendicitis. In a single-center, unblinded, parallel randomized controlled trial of noncontrast CT and ultrasound (US) in patients with atypical right iliac fossa pain, Jones et al [32] found that noncontrast CT had a diagnostic accuracy of $73 \\%$, a PPV of $100 \\%$, and an NPV of $100 \\%$ for acute appendicitis. In addition to acute appendicitis, CT findings included diverticulitis and ileocecitis.\n\n# Fluoroscopy Contrast Enema \n\nThere is no relevant literature supporting the use of contrast enema in the evaluation of RLQ pain.\n\n## MRI Abdomen and Pelvis\n\nRelevant articles from the literature search included 1 retrospective study, 5 prospective studies, and 1 meta-analysis for the topic of MRI in the evaluation of acute abdominal or RLQ pain particularly for the diagnosis of appendicitis and alternate diagnoses in adult patients, not limited to pregnant patients. Studies reporting utility of MRI with IV contrast should be interpreted as MRI performed without and with IV contrast because noncontrast MRI sequences such as T2-weighted images, DWIs, and precontrast T1-weighted images are integral to all contrast-enhanced MRI examinations.\n\nIn patients with RLQ pain, MRI allows accurate diagnosis of appendicitis as well as suggesting alternative diagnosis [3]. In a prospective study of 52 patients, gadolinium-enhanced MRI provided an alternative diagnosis in $52 \\%$ of patients, which included diverticulitis, ileitis, colitis, ischemia, small bowel adhesions, pancreatitis, inguinal hernia, Crohn's disease, choledocholithiasis, and ileus [3]. A prospective multicenter diagnostic accuracy study performed to determine the accuracy and interobserver agreement between MR-expert and MR-nonexpert radiologists identified alternative urgent diagnoses including diverticulitis, urgent gynecological disorders, urgent urinary tract", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=5, markdown="disorders, bowel obstruction, and pneumonia. The sensitivity for detecting all urgent diagnoses for nonexpert radiologists was $84 \\%$ ( $95 \\%$ CI: $78 \\%-88 \\%$ ) compared with expert radiologists with $95 \\%$ ( $95 \\%$ CI: $90 \\%-98 \\%$ ). The specificity for detecting all urgent diagnoses for nonexpert radiologists was $71 \\%$ ( $95 \\%$ CI: $62 \\%-79 \\%$ ) compared with expert radiologists at $100 \\%$ ( $95 \\%$ CI: $76 \\%-100 \\%$ ). Interobserver agreement expressed as Cohen's $\\kappa$ was 0.63 ( $95 \\%$ CI: $0.55-0.70$ ), consistent with good (but not excellent) agreement [35]. For this document, it is assumed that the procedure is performed and interpreted by an expert. A single-institution retrospective study of 403 patients ( 3 to 49 years of age) undergoing noncontrast MRI without oral or IV contrast identified both urgent and nonurgent alternative diagnoses in 336 patients. These conditions included gastrointestinal, gynecologic, urinary tract, musculoskeletal, inflammatory, neoplastic, and congenital conditions [36].\nRegarding simple acute appendicitis, MRI for experienced readers had a sensitivity of $85 \\%$ to $98 \\%$ [3,37], a specificity of $93 \\%$ to $99.4 \\%$ ( $95 \\%$ CI: $97.9 \\%-99.9 \\%$ ) [35,36], a PPV of $94 \\%$ ( $95 \\%$ CI: $88 \\%-97 \\%$ ), an NPV of $100 \\%$ [21,35], and an accuracy of $93.75 \\%$ to $96 \\%$ [21,35]. Values for less-experienced readers had a sensitivity of $77 \\%$ to $89 \\%$ ( $95 \\%$ CI: $77 \\%-88 \\%$ ) [3,35], a specificity of $79 \\%$ to $83 \\%$ ( $95 \\%$ CI: $77 \\%-88 \\%$ ) [3,35], a PPV of $86 \\%$ ( $95 \\%$ CI: $81 \\%-90 \\%$ ), and an NPV of $88 \\%$ ( $95 \\%$ CI: $82 \\%-91 \\%$ ) [35]. For this document, it is assumed that the procedure is performed and interpreted by an expert.\nMRI performance for diagnosis of perforated appendicitis was published in 2 studies [21,38] and was demonstrated to be less robust with sensitivities and specificities of $57 \\%$ ( $95 \\%$ CI: $39 \\%-73 \\%$ ) and $86 \\%$ ( $95 \\%$ CI: $77 \\%-91 \\%$ ) and $77.8 \\%$ and $91.7 \\%$, respectively. Subgroup analyses of pediatric, pregnant, male, and female patients were performed in 2 studies without reaching statistical significance in either [36,39]. This finding was also not significantly different when compared with US with conditional CT diagnostic strategy [38]. Specific variations in technique were evaluated prospectively in 2 studies. Diagnostic performance to assess T2 HASTE imaging in 468 patients ( 7 59 years of age) yielded a sensitivity and specificity of $98 \\%$ ( $95 \\%$ CI) and $92 \\%$ ( $95 \\%$ CI), respectively [37]. Evaluation of DWI and ADC demonstrated mean ADC value for patients with appendicitis were significantly lower compared with controls. The sensitivity, specificity, NPV, and PPV for detecting appendicitis were reported as $97.5 \\%, 100 \\%, 93.75 \\%$, and $100 \\%$, respectively [21]. The protocol [39] also included DWI with procedure sensitivity and specificity of $97 \\%$ and $93 \\%$, respectively. Combined diagnostic performance of 1.5 T and 3.0 T systems demonstrated a sensitivity and specificity of $97.0 \\%$ ( $95 \\%$ CI: $89.6 \\%-99.6 \\%$ ) and $99.4 \\%$ ( $95 \\%$ CI: $97.9 \\%-$ $99.9 \\%$ ), respectively, and an absence of statistically significant differences between the 2 field strengths [36].\nA meta-analysis performed from 30 studies from 1997 through 2015 contained a total of 2,665 patients that included pediatric, adult, and pregnant patients. The sensitivity and specificity for MRI detection of appendicitis were $96 \\%$ ( $95 \\%$ CI: $95 \\%-97 \\%$ ) and $96 \\%$ ( $95 \\%$ CI: $95 \\%-97 \\%$ ), respectively. This study did not find a statistically significant difference for the diagnostic accuracy of appendicitis between studies that were performed without IV contrast and those performed with IV contrast [40]. One study reported sensitivity and specificity for MRI detection of perforated appendicitis as $57 \\%$ and $86 \\%$, respectively. This finding was not significantly different when compared with US with conditional CT [38].\n\n# Radiography Abdomen \n\nWith the shift to cross-sectional imaging modalities for evaluation of patients with RLQ pain, there is little current literature on radiographic signs. A prospective single-institution study [41] of the fecal loading sign, cecum distended with stool containing innumerable punctate lucencies, evaluated 470 adult and pediatric patients with acute abdominal pain. Patients were divided into 4 groups, with the appendicitis group subdivided into patients with preoperative only and both preoperative and postoperative abdominal radiographs. The fecal loading sign had a sensitivity, specificity, PPV, and NPV of $97.05 \\%, 85.33 \\%, 78.94 \\%$, and $98 \\%$, respectively. Fecal loading in the cecum was associated with all stages of appendicitis and disappeared after appendectomy. This sign was uncommon in other acute inflammatory diseases of the right side of the abdomen evaluated, which includes right nephrolithiasis (19\\%), right pelvic inflammatory disease (12\\%), and acute cholecystitis (13\\%).\n\n## US Abdomen\n\nUS research articles did not consistently differentiate abdominal from pelvic US protocols. The methods sections were reviewed, and, where specified, articles were separated into abdomen or pelvis. Studies of the right iliac fossa were designated pelvis. Articles referring to graded compression US technique [42], those that specified abdomen, and unspecified studies are included in this section. Graded compression grayscale US is a modification to abdominal US, taking advantage of patient respiratory motion by deepening abdominal compression using the transducer and both of the operator's hands upon exhalation to displace intervening organs and simulate clinical deep abdominal palpation [42]. This technique has moderate performance characteristics for diagnosing", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=6, markdown='appendicitis, exacerbated in North America by the decreased visualization rate of the appendix [43-47] in comparison with Europe and Asia because of due to perceived limitation related to patient body habitus.\n\nTwo studies, one retrospective and the second prospective, of US evaluation specifically of patients with atypical presentation of appendicitis, atypical lab results [48], or nonspecific abdominal pain [49] were identified in the current literature. The first study demonstrated US sensitivity, specificity, PPV, NPV, and accuracy of $71.4 \\%$, $78.5 \\%, 94.8 \\%, 33.3 \\%$, and $72.5 \\%$, respectively. Subgroup analysis of performance of emergency physicians with FAST experience plus training, 1 day didactic and 1 day practical course for abdominal US examination performance, and radiologists in US diagnosis yielded statistically significant differences: emergency physicians identified $33.3 \\%$ ( 9 of 27 ) of patients with appendicitis, and radiologists identified $59.2 \\%$ (16 of 27) ( $P=.001$ ). For this document, it is assumed that the procedure is performed and interpreted by an expert.\n\n# US Pelvis \n\nUS research articles did not consistently differentiate abdominal from pelvic US protocols. The methods sections were reviewed, and, where specified, articles were separated into abdomen or pelvis. Studies of the right iliac fossa were designated pelvis. Two multi-institution retrospective studies were identified, which included all patients who had undergone US before appendectomy [50,51]. D\'Souza et al [51], in a review of 573 adult and pediatric patients ( $>6$ years of age), yielded mean sensitivity and specificity in patients with visualization of the appendix of $81.7 \\%$ and $53.9 \\%$ and total patient population mean values of $51.8 \\%$ and $81.4 \\%$. NAR in all patients evaluated with US was $38.4 \\%$. The rate for patients with appendix visualization and positive results was $18.3 \\%$. The appendix was not visualized in $45 \\%$ of the patients. A review of 620 patients with US performed [50] yielded a nonvisualization rate of $27.7 \\%$. Evaluation of indirect signs of appendicitis in the nonvisualization subgroup yielded a sensitivity of $31.8 \\%$ to $83.9 \\%$, a specificity of $56.7 \\%$ to $96.7 \\%$, a PPV of $25 \\%$ to $95.8 \\%$, and an NPV of $57.2 \\%$ to $83.3 \\%$, depending on presence and combination of the evaluated indirect signs, pain, hypertrophic periappendiceal fat, and diminished periappendiceal peristalsis. In patients where gynecological conditions are a primary concern, please refer to the ACR Appropriateness Criteria ${ }^{\\circledR}$ topic on "Acute Pelvic Pain in the Reproductive Age Group" [24].\n\n## WBC Scan Abdomen and Pelvis\n\nThere is no recent literature regarding the use of Tc-99m white blood cell (WBC) scan abdomen and pelvis in the evaluation of RLQ pain. However, in a blinded prospective study of 30 patients with suspected appendicitis, Foley et al [52] showed that the Tc-99m WBC scan achieved a sensitivity of $81 \\%$, a specificity of $100 \\%$, and an accuracy of $89 \\%$. Because delayed imaging, up to 4 hours post injection, may be required for diagnosis with this procedure, utility may be in the identification of alternate diagnoses of abdominal pain other than appendicitis, especially given the diagnostic performance and rapidity of CT.\n\n## Variant 2: Right lower quadrant pain, fever, leukocytosis. Suspected appendicitis. Initial imaging.\n\nIn this clinical scenario, the patient presents with RLQ pain in which the leading clinical diagnostic consideration is appendicitis. Alternative etiologies such as nonappendiceal gastrointestinal, genitourinary, hepato-pancreatic, and gynecologic conditions remain less likely diagnostic possibilities.\nThe "classic" clinical presentation of patients with appendicitis consisted of periumbilical abdominal pain migrating to the RLQ, loss of appetite, nausea, or vomiting, with fever, and leukocytosis is present in approximately $50 \\%$ of patients. This explains the historical NAR of $14.7 \\%$ and incidental appendectomy rate of $47 \\%$, in which incidental appendectomy refers to the practice of removing a normal appendix in the course of a nonrelated surgical procedure to prevent future development of appendicitis [31]. These statistics and growing recognition of the long-term morbidity associated with negative laparotomy have led to the incorporation of preoperative imaging of patients with suspected appendicitis into clinical management algorithms. The diagnostic performance of imaging modalities varies from each other and in different patient populations.\n\n## CT Abdomen and Pelvis\n\nCT has become the most useful diagnostic imaging modality for the evaluation of patients with suspected appendicitis because of its high diagnostic yield. The use of CT for adult emergency department visits for diagnosis of appendicitis increased from $7.2 \\%$ ( $95 \\%$ CI: $2.7 \\%-17.6 \\%$ ) to $83.3 \\%$ ( $95 \\%$ CI: $64.1 \\%-93.3 \\%$ ) between 1997 to 2016 [30]. In the current literature, the NAR range with preoperative CT is $1.7 \\%$ to $7.7 \\%$ [8,53]. In a meta-analysis, Krajewski et al [10] reported that using preoperative CT resulted in a NAR of $8.7 \\%$ versus a rate of $16.7 \\%$ with clinical evaluation alone. The sensitivities range from $85.7 \\%$ to $100 \\%$, and the specificities range from $94.8 \\%$ to $100 \\%[54,55]$. Sensitivity was lowest in nonenhanced CT without enteral contrast [55]. However, a meta-analysis of 7 prospective studies of nonenhanced CT that included patient populations of 49 to 296 resulted in a sensitivity', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=7, markdown="of $0.90(95 \\%$ CI: $0.86-0.92)$ and a specificity of $0.94(95 \\%$ CI: $0.92-0.97$ ) [56]. Concerns raised regarding delay in diagnosis and treatment that are due to oral contrast regimens with potential impact on patients of increased risk of perforation and associated morbidity have fueled evaluation of contrast-enhanced CT with versus without enteral contrast. Contrast-enhanced CT without enteral contrast sensitivities range from $90 \\%$ to $100 \\%$, and specificities range from $94.8 \\%$ to $100 \\%$ [54,57], compared with contrast-enhanced CT with enteral contrast (oral or rectal), for which sensitivities range from $90.4 \\%$ to $100 \\%$ and specificities range from $97.67 \\%$ to $100 \\%$ [55,57]. In addition, a single-institution retrospective study of contrast-enhanced CT without enteral contrast in 1,922 patients (16-99 years of age) with a body mass index $>25$ and nontraumatic abdominal pain yielded 799 ( $40.1 \\%$ ) positive CT scans for acute abdominal pathology, explaining the patient's symptomatology. Subgroup analysis of 113 patients with appendicitis yielded a sensitivity of $100 \\%$ and a specificity of $99.5 \\%$ with only 4 patients ( $0.2 \\%$ ), none of whom were in the appendicitis subgroup, returning for repeat CT because of a lack of oral contrast [58].\nIn a recent meta-analysis of 71 study populations, Rud et al [28] reported a summary sensitivity of 0.95 ( $95 \\%$ CI: $0.93-0.96$ ) and a summary specificity of $0.94(95 \\%$ CI $0.92-0.95)$ for CT in the diagnosis of appendicitis. For standard dose unenhanced CT, the summary sensitivity and specificity from 19 studies were 0.91 ( $95 \\%$ CI: 0.87 0.93 ) and 0.94 ( $95 \\%$ CI: $0.90-0.96$ ), respectively [28]. The summary sensitivity for contrast-enhanced CT (18 study populations in 17 studies) was higher ( $0.96,95 \\%$ CI: $0.92-0.98$ ) compared with unenhanced CT ( $0.90,95 \\%$ CI: $0.87-0.93$ ), whereas the summary specificity was comparable ( $0.93,95 \\%$ CI: $0.90-0.95$ versus $0.94,95 \\%$ CI: $0.90-$ 0.96) [28]. In 9 studies reporting CT with rectal contrast enhancement, the summary sensitivity was 0.97 ( $95 \\%$ CI: $0.93-0.99$ ), and the summary specificity was $0.95(95 \\%$ CI: $0.90-0.98$ ), higher than unenhanced CT [28]. There was no significant difference between CT with oral contrast enhancement versus unenhanced CT [28]. The summary sensitivity for CT with IV and oral contrast enhancement ( 15 studies) was higher than unenhanced CT ( $0.96,95 \\%$ C:I 0.93-0.98) [28]. Low-dose CT regardless of contrast enhancement had similar summary sensitivity ( $0.94,95 \\%$ CI: $0.90-0.97$ ) and specificity ( $0.94,95 \\%$ CI: $0.91-0.96$ ) [28]. There are no studies comparing the utility of CT without and with IV contrast in the diagnosis of patients with suspected appendicitis.\n\nCT signs of appendicitis have variable accuracy. In 1 retrospective study [59] of CT signs of appendicitis in 224 patients with negative or equivocal contrast-enhanced CT without enteral contrast, maximal outer diameter $>6 \\mathrm{~mm}$, fat stranding, and absence of intraluminal gas were present in patients with appendicitis versus without: $66.3 \\%$ versus $37.0 \\%(P<.001), 34.1 \\%$ versus $8.9 \\%(P=.001)$, and $67.6 \\%$ versus $48.9 \\%(P=.024)$, respectively. With 2 or more signs present, the OR of appendicitis being present was 6.8 ( $95 \\%$ CI: $3.013-15.454 ; P<.001$ ). In a second retrospective study of 100 patients with inconclusive nonenhanced CT followed by contrast-enhanced CT, signs of appendicitis with statistical significance and cutoff values with best sensitivity and specificity were calculated. These were maximal cross-sectional diameter of $8.5 \\mathrm{~mm}, 90.2 \\%$ and $91.5 \\%$; presence of periappendiceal infiltrates $1.5,53.7 \\%$ and $94.9 \\%$; and periappendiceal fluid (graded $0-3$ for absent to severe) $2.5,22 \\%$ and $100 \\%$ [60]. An additional retrospective study reviewed contrast-enhanced CT without enteral contrast scans of 216 patients, 80 with pathologically proven appendicitis and 136 clinically negative for appendicitis, to evaluate the diagnostic performance and identify optimal cutoff of CT signs [61]. The maximum outer diameter (MOD) had an area under the curve (AUC) of 0.967 with an optimal cutoff of 8.2 mm yielding a sensitivity, specificity, and accuracy of $88.8 \\%, 93.4 \\%$, and $91.7 \\%$, respectively. Diameter with compression (MOD minus compressible contents) had an AUC of 0.973 with an optimal cutoff value of 6.6 mm and a sensitivity, specificity, and accuracy of $93.8 \\%, 94.9 \\%$, and $94.4 \\%$, respectively. Frequently referenced cutoff value of 6 mm for MOD yielded a sensitivity of $97.5 \\%$, a specificity of $59.6 \\%$, and an accuracy of $73.6 \\%$.\n\nHistorical perforation rates for men and women are $19.2 \\%$ and $17.8 \\%$, respectively [31]. An association with increased morbidity, mortality, and length of stay drives the desire to identify early signs of appendiceal necrosis and occult perforation, before the development of phlegmon, abscess, or gross free peritoneal gas. A retrospective study of 102 patients, 49 with perforation, demonstrated that only $19(37 \\%)$ were diagnosed prospectively, yielding CT sensitivity, specificity, and PPV of $38 \\%, 96 \\%$, and $90 \\%$, respectively [62]. Statistically, significantly associated findings were extraluminal gas (OR, 28.9; $P=.02$ ); intraluminal fecalith (OR, 5.7; $P=.03$ ); and wall thickness $>3$ $\\mathrm{mm}(\\mathrm{OR}, 3.2 ; P=.02$ ). Two retrospective studies $[63,64]$ identified patients with pathologically proven appendicitis and excluded those with gross CT evidence of perforation resulting in patient cohorts of 374 and 339 , respectively. Occult appendiceal perforation/necrosis rates were 65/374 (17.4\\%) and 75/339 (22.1\\%), respectively. Intraluminal gas and appendicoliths were predictive of the presence of perforation with an OR of 2.64 ( $95 \\%$ CI: 1.48-4.73) and 2.67 ( $95 \\%$ CI: 1.55-4.61), respectively [63]. Sensitivity and specificity for these 2 signs were $36.9 \\%$ and $81.9 \\%$ (intraluminal air) and $55.4 \\%$ and $68.3 \\%$ (intraluminal appendicolith), respectively. Kim et al [64] also found", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=8, markdown='appendicoliths predictive (OR 2.47; $P=.015$ ) and the additional signs of focal wall defect (OR 23.40; $P<.001$ ), circumferential periappendiceal inflammatory changes (OR, 5.63; $P<.001$ ), and transverse diameter of the appendix (OR, 1.22; $P=.003$ ). Transverse diameter of $\\geq 11 \\mathrm{~mm}$ had the greatest sensitivity, $62.7 \\%$ (range $29.3 \\%$ $62.7 \\%$ ), and focal wall defect had the greatest specificity, $98.8 \\%$ (range $66.3 \\%-98.8 \\%$ ).\n\nCT as Second-Line Imaging Test after Initial US: In a recent meta-analysis of second-line imaging modalities in the diagnosis of acute appendicitis after initial US, the pooled sensitivities and specificities for second-line CT in 11 studies that included 1,027 patients were $89.9 \\%$ ( $95 \\%$ CI: $85.4 \\%-93.2 \\%$ ) and $93.6 \\%$ ( $95 \\%$ CI: $91.2 \\%-95.3 \\%$ ), respectively [65]. Two single-institution studies reviewed the performance of CT following nondiagnostic US. One was a retrospective review of 119 patients [66] with suspected appendicitis and nonvisualized appendix on otherwise normally graded compression US, pelvic US in women with transvaginal US of childbearing age, if not declined, and body mass index $<30$. Contrast-enhanced CT was performed within 48 hours in all patients. Patients were additionally divided into groups based on an AS of 3 or less ( 49 patients) and of 4 or more ( 70 patients). Diagnostic rate for appendicitis in the low AS group was 0 of 49 patients; the high AS group was 12 of 70 patients, with 11 true positive, 1 false-negative, and 2 false-positives ( $17.1 \\%$ ). Alternate diagnoses were absent in 42 of 49 patients ( $85.7 \\%$ ) of the low AS group and 41 of 70 patients ( $58.6 \\%$ ) of the high AS group with 2 of 70 patients $(2.9 \\%)$ requiring surgery [66]. The second retrospective review evaluated 318 ( 150 adult and 168 pediatric) patients with suspected appendicitis, graded compression US as initial imaging study, nonvisualization of the appendix, and absence of other pathology on US who underwent contrast-enhanced CT without enteral contrast within 48 hours of US examination. Alternate diagnoses on CT included appendicitis in 52 ( $16.4 \\%$; $95 \\%$ CI: $12.5 \\%-20.9 \\%$ ), 7 perforated ( $13.5 \\%$; $95 \\%$ CI: $5.6 \\%-25.8 \\%$ ); other diagnoses in $16(5.0 \\% ; 95 \\%$ CI: $2.9 \\%-8.0 \\%)$ with 2 of these requiring surgical intervention $(0.6 \\%)$; and 250 patients without identifiable etiology for their clinical presentation $(78.6 \\% ; 95 \\%$ CI: $73.7 \\%-83.0 \\%)[67]$.\n\nAlternate Diagnoses: Several studies included information on the performance of CT for the detection of alternative diagnoses in this patient population presenting with classic symptomatology. Proportions of patients with identification of alternate etiologies for their clinical presentation ranged from a low of $23.2 \\%$ [18] to a high of $45.3 \\%$ [29]. The 2 studies with the highest performance at $42.5 \\%$ [54] and $45.3 \\%$ [29] were both conducted at tertiary care centers suggesting impact based on differences in patient population compared with rural or nontertiary centers. There is a wide range of etiologies, with the most common involving the gastrointestinal system, gynecologic, genitourinary, and hepatopancreaticobiliary systems. A single-institution retrospective study [27] demonstrated rates of $46.0 \\%, 21.6 \\%, 16.9 \\%$, and $7.7 \\%$ for these systems, respectively.\n\n# Fluoroscopy Contrast Enema \n\nThere is no relevant literature supporting the use of contrast enema in the evaluation of RLQ pain, fever, leukocytosis, or suspected appendicitis.\n\n## MRI Abdomen and Pelvis\n\nStudies reporting utility of MRI with IV contrast should be interpreted as MRI performed without and with IV contrast because noncontrast MRI sequences such as T2-weighted images, DWIs, and precontrast T1-weighted images are integral to all contrast-enhanced MRI examinations.\n\nThere is variability in the techniques employed and evaluated by the research groups with regards to MRI. Technical quality may also suffer in the acute setting because of patient discomfort with attendant motion artifacts. A singleinstitution retrospective study of 403 patients 3 to 49 years of age using 1.5 T and 3.0 T systems calculated the sensitivity and specificity for MRI detection of appendicitis to be $97.0 \\%$ ( $95 \\%$ CI: $89.6 \\%-99.6 \\%$ ) and $99.4 \\%$ ( $95 \\%$ CI: $97.9 \\%-99.9 \\%$ ), respectively. Imaging was performed without IV contrast. The average scan time for this study was 14 minutes. No significant difference was detected on subgroup analysis of pediatric and pregnant patients [36]. A prospective diagnostic study of 468 patients, 7 to 59 years of age, assessed the performance of T2 HASTE imaging on a 1.5 T system for diagnosing appendicitis. The sensitivity and specificity were reported as $98 \\%$ (CI $95 \\%$ ) and $92 \\%$ (CI $95 \\%$ ), respectively, when compared with direct visualization ( $\\mathrm{n}=90$ ). Axial and coronal T2 HASTE images were acquired with a reported table time of $<2$ minutes [37]. In a single-institution retrospective study of 51 patients undergoing noncontrast MRI on a 1.5 T system for acute appendicitis compared T2-weighted images alone with T2-weighted images and DWI. The accuracy for diagnosing acute appendicitis improved from $78.4 \\%$ to $82.4 \\%$ to $86.3 \\%$ using combined T2-weighted images and DWI [23]. Two studies evaluated the diagnostic capability of MRI and reader experience on performance. A single-institution prospective study of 52 patients, aged 18 to 88 years, calculated the sensitivity and specificity for detecting appendicitis. For experienced readers, these were $85 \\%$ and $97 \\%$, respectively. The sensitivity and specificity for less-experienced MRI readers were $77 \\%$ and', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=9, markdown="$79 \\%$, respectively. MRI was performed on a 1.5 T system, without and with IV contrast with administration of Buscopan to diminish peristalsis [3]. The second study, a prospective multicenter diagnostic accuracy study, was performed to determine the accuracy and interobserver agreement between MR-expert and MR-nonexpert radiologists. The study included 223 patients who were $\\geq 18$ years of age. Imaging was performed on a 1.5 T system without IV contrast. The sensitivity for detecting appendicitis for nonexpert radiologists was $89 \\%$ ( $95 \\%$ CI: $84 \\%$ $93 \\%$ ), compared with expert radiologists at $97 \\%$ ( $95 \\%$ CI: $0.91 \\%-0.99 \\%$ ). The specificity for nonexpert radiologists was $83 \\%$ ( $95 \\%$ CI: $77 \\%-88 \\%$ ), compared with expert radiologists at $93 \\%$ ( $95 \\%$ CI: $87 \\%-97 \\%$ ). The PPV for nonexpert radiologists was $86 \\%$ ( $95 \\%$ CI: $81 \\%-90 \\%$ ), compared with expert radiologists at $94 \\%$ ( $95 \\%$ CI: $88 \\%$ $97 \\%$ ). The NPV for nonexpert radiologists was $88 \\%$ ( $95 \\%$ CI: $82 \\%-91 \\%$ ), compared with that for expert radiologists at $96 \\%$ ( $95 \\%$ CI: $90 \\%-98 \\%$ ). Interobserver agreement expressed as Cohen's $\\kappa$ was 0.71 ( $95 \\%$ CI: $0.73-0.84$ ), consistent with good (but not excellent) agreement [39]. For this document, it is assumed that the procedure is performed and interpreted by an expert.\nTwo prospective multicenter studies were identified. The first, a prospective diagnostic performance study of 230 patients, was conducted to compare the performance of MRI with an US with conditional CT imaging strategy. The sensitivity and specificity for MRI were $97 \\%$ and $93 \\%$, respectively. These values were similar to the US with conditional CT strategy. There were no statistically significant changes in sensitivity and specificity on the subgroup analysis of male and female patients. The MRI protocol included DWI without postcontrast imaging performed on 1.5 T systems [39]. The second, a prospective diagnostic accuracy trial of 130 patients who were $\\geq 18$ years of age, was performed to determine the accuracy of MRI ( 1.5 T system) compared with US with conditional CT in the differentiation of simple versus perforated appendicitis. The sensitivity and specificity of MRI for perforated appendicitis were $57 \\%$ ( $95 \\%$ CI: $39 \\%-73 \\%$ ) and $86 \\%$ ( $95 \\%$ CI: $77 \\%-91 \\%$ ), respectively. The PPV and NPV were $57 \\%$ ( $95 \\%$ CI: $39 \\%-73 \\%$ ) and $86 \\%$ ( $95 \\%$ CI: $77 \\%-91 \\%$ ). These values were not significantly different compared with US with conditional CT technique [38].\nA meta-analysis of 30 studies from 1997 through 2015 included 2,665 pediatric, adult, and pregnant patients. The sensitivity and specificity for MRI detection of appendicitis were $96 \\%$ ( $95 \\%$ CI: $95 \\%-97 \\%$ ) and $96 \\%$ ( $95 \\%$ CI: $95 \\%-97 \\%$ ), respectively. This study did not find a statistically significant difference for the diagnostic accuracy of appendicitis between studies performed without versus those performed with IV contrast [40].\nA recent meta-analysis of second-line imaging modalities in the diagnosis of acute appendicitis after initial US included 6 MRI studies and 427 patients. The pooled sensitivities and specificities for second-line MRI were $89.9 \\%$ ( $95 \\%$ CI: $84.8 \\%-93.5 \\%$ ) and $93.6 \\%$ ( $95 \\%$ CI: $90.9 \\%-95.5 \\%$ ), respectively [65].\nThere is no relevant literature comparing MRI with 1.5 T versus 3.0 T systems for the detection of acute appendicitis. No randomized control studies comparing MRI with CT, US, or US with conditional CT were included in the literature search strategy.\n\n# Radiography Abdomen \n\nWith the shift to cross-sectional imaging modalities for evaluation of patients with suspected appendicitis, there is little current literature on radiographic signs. A prospective single-institution study [41] of the fecal loading sign, cecum distended with stool containing innumerable punctate lucencies, evaluated 470 adult and pediatric patients with acute abdominal pain. Patients were divided into 4 groups, with the appendicitis group subdivided into patients with preoperative only and both preoperative and postoperative abdominal radiographs. Fecal loading sign had a sensitivity, specificity, PPV, and NPV of $97.05 \\%, 85.33 \\%, 78.94 \\%$, and $98 \\%$, respectively. Fecal loading in the cecum was associated with all stages of appendicitis and disappeared after appendectomy. This sign was uncommon in other acute inflammatory diseases of the right side of the abdomen evaluated, which includes right nephrolithiasis (19\\%), right pelvic inflammatory disease (12\\%), and acute cholecystitis (13\\%).\n\n## US Abdomen\n\nUS research articles did not consistently differentiate abdominal from pelvic US protocols. The methods sections were reviewed, and, where specified, articles were separated into abdomen or pelvis. Studies of the right iliac fossa were designated pelvis. Articles referring to graded compression US technique [42,68], those that specified abdomen, and unspecified studies are included in this section. Graded compression is a modification to abdominal US taking advantage of patient respiratory motion, deepening abdominal compression using the transducer and both of the operator's hands upon exhalation to displace intervening organs and simulate clinical deep abdominal palpation $[42]$.", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=10, markdown='Diagnostic performance of US in preoperative evaluation of patients presenting with typical signs and symptoms of appendicitis vary widely. Ranges for measures are as follows: NAR of $4.4 \\%$ to $28.2 \\%$; sensitivity of $21.0 \\%$ to $95.7 \\%$; specificity of $71.4 \\%$ to $97.9 \\%$; PPV of $41.2 \\%$ to $94 \\%$; and NPV of $49 \\%$ to $89.6 \\%$ [68-73]. When reported, appendix visualization ranged from $35 \\%$ [73] to $52.9 \\%$, with difference by sex of $65 \\%$ in men and $51.1 \\%$ in women [71]. One study defined an equivocal group that consisted of incomplete or nonvisualization of the appendix, which comprised $81.4 \\%$ of the total study population [72].\nSubgroup analyses were performed in several studies. Comparison of US positive versus equivocal for appendicitis sensitivity and PPV ( $95 \\% \\mathrm{CI}$ ) were $48.4 \\%$ (35.8-61.3) and $83.8 \\%$ (68.0-93.8) for the positive group and $21.0 \\%$ (9.038.9) and $41.2 \\%$ (18.5-67.0) for the equivocal group, respectively [72]. Analysis of male versus female patients [69] resulted in sensitivity, specificity, and false-positive rates of $95.7 \\%, 88.2 \\%$, and $6.2 \\%$ in men and $84.6 \\%$, $71.4 \\%$, and $35.5 \\%$ in women, respectively. This study also demonstrated significant differences in nonobese versus obese men and women, with false diagnosis (false-positive + false-negative) of $6.2 \\%$ versus $34.4 \\%(P<.001)$ in men and $38.5 \\%$ versus $46.2 \\%(P<.001)$ in women, respectively. Evaluation of various patient characteristics resulted in 3 with statistical significance. Subgroup of body mass index $<22$, pain index of $>6$, and AS $>6$ yielded 2.3-, 2.9-, and 3.8 -fold greater likelihood appendix visualization at US, respectively [71].\n\nA recent meta-analysis assessed the clinical value and accuracy of bedside US for diagnosis of acute appendicitis in the emergency department [74]. Shen et al [74] included 27 studies and 7,403 patients published between 1996 and 2018, which included 8 from the United States (30\\%), 7 from Europe (26\\%), 11 from Asia (41\\%), and 1 from Africa (3\\%). The mean sensitivity and specificity of bedside US for diagnosing acute appendicitis was $90 \\%$ ( $95 \\%$ CI: $82 \\%-95 \\%$ ) and $95 \\%$ ( $95 \\%$ CI: $89 \\%-98 \\%$ ), respectively. The diagnostic performance for US as a second-line imaging modality in the diagnosis of acute appendicitis was explored in a recent meta-analysis. The meta-analysis included 3 US studies and 169 patients. The pooled sensitivities and specificities for second-line US were $83.1 \\%$ ( $95 \\%$ CI: $70.3 \\%-91.1 \\%$ ) and $93.6 \\%$ ( $95 \\%$ CI: $59.3 \\%-98.6 \\%$ ), respectively [65].\n\n# US Pelvis \n\nUS research articles did not consistently differentiate abdominal from pelvic US protocols. The methods sections were reviewed, and, where specified, articles were separated into abdomen or pelvis. Studies of the right iliac fossa were designated pelvis. Three retrospective studies of pelvic US were identified, with 1 single-institution study combining transabdominal and transvaginal imaging in 292 women [75] and 2 multi-institutional studies [50,51] evaluating 573 and 620 male and female patients with iliac fossa US, respectively. The greatest sensitivity of $97.3 \\%$, specificity of $91.0 \\%$, PPV of $91.7 \\%$, and NPV of $97 \\%$ were achieved when combining transabdominal US and transvaginal US performed by a single experienced operator in adult women ( $95 \\% \\mathrm{CI}$ ). Nonvisualization of the appendix ranged from $20.3 \\%$ [50] to $45 \\%$ [51]. There is wide variability, with ranges of sensitivity of $31.8 \\%$ to $83.9 \\%$, specificity of $56.7 \\%$ to $96.7 \\%$, PPV of $25 \\%$ to $95.8 \\%$, and NPV of $57.2 \\%$ to $83.3 \\%$ related to presence and combination of clinical and US signs of appendicitis that include pain, hypertrophic fat, and diminished peristalsis [50]. NARs ranged from $8.3 \\%$ [75] to $38.4 \\%$ [51].\n\n## WBC Scan Abdomen and Pelvis\n\nThere is no recent literature regarding the use of Tc-99m WBC scan abdomen and pelvis in the evaluation of RLQ pain, fever, leukocytosis, and suspected appendicitis. However, in a blinded prospective study of 30 patients with suspected appendicitis, Foley et al [52] showed that the Tc-99m WBC scan achieved a sensitivity of $81 \\%$, a specificity of $100 \\%$, and an accuracy of $89 \\%$. Because delayed imaging, up to 4 hours post injection, may be required for diagnosis with this procedure, utility may be in identification of alternate diagnoses of abdominal pain other than appendicitis, especially given the diagnostic performance and rapidity of CT.\n\n## Variant 3: Pregnant woman. Right lower quadrant pain, fever, leukocytosis. Suspected appendicitis. Initial imaging.\n\nIn this clinical scenario, the patient is pregnant and presents with RLQ pain in which the leading clinical diagnostic consideration is appendicitis. Issues related to the safety to the fetus during diagnostic workup is a major consideration. Alternative etiologies such as nonappendiceal gastrointestinal, genitourinary, hepato-pancreatic, and gynecologic conditions remain less likely diagnostic possibilities.\nAppendicitis is among the most frequently encountered nonobstetric surgical condition in pregnant women [76,77]. The EAST multicenter study [77], a post hoc analysis of 3,597 subjects, showed that pregnant women accounted for 1 in 20 women of childbearing age presenting with appendicitis and it most commonly manifested in early to mid-pregnancy. Pregnant women with appendicitis had similar clinical outcomes compared with nonpregnant', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=11, markdown='women, although they were more likely to undergo nonsurgical management [77]. Pregnant women are more likely to present with complicated (perforated or gangrenous) appendicitis, and in those with perforation, there is higher risk of fetal loss underlining the need for early diagnosis and treatment [76,77]. Imaging plays an important role in the diagnosis and management of pregnant patients with RLQ and suspected appendicitis. Vasileiou et al [77], leading the EAST multicenter study, reported that a combination of US abdomen and MRI was the most commonly used imaging modality ( $41 \\%$ ) followed by MRI alone ( $29 \\%$ ), US alone ( $22 \\%$ ), CT (5\\%), and no imaging ( $2 \\%$ ).\n\n# CT Abdomen and Pelvis \n\nThe literature specific to the use of CT in the evaluation of RLQ pain, fever, and leukocytosis in pregnant patients is limited. Several studies have included pregnant patients in their study populations. The first is Kontopodis et al [33], with a study of patients with atypical presentation that included 10 pregnant patients. These patients were proportionally distributed in the 4 subgroups, low or high AS with or without imaging, and demonstrated no significant difference from the nonpregnant patients. The second is Ramalingam et al [78], who evaluated a multimodality diagnostic strategy for pregnant patients, 9 of whom had CT after US (1 patient) or MRI (8 patients). No additional cases of appendicitis were detected by CT following US alone, MRI alone, or MRI following inconclusive US.\n\nIn a recent study, Poletti et al [79] evaluated unenhanced LDCT scan with oral contrast in assessment of pregnant women presenting with RLQ pain when MRI was not immediately available. In this single-institution study 37 pregnant patients 20 to 44 years of age with clinical suspicion of appendicitis were included. Among the 37 patients, $30 \\%(\\mathrm{n}=11)$ were in the first trimester, $38 \\%(\\mathrm{n}=14)$ were in the second trimester, and $32 \\%(\\mathrm{n}=12)$ were in the last trimester. LDCT was performed in $78 \\%(\\mathrm{n}=28)$ of patients with indeterminate or negative US with high/moderate clinical suspicious of appendicitis. LDCT was conclusive for diagnosis of appendicitis in $83 \\%(\\mathrm{n}=$ $24 / 29$ ) of patients and indeterminate in $17 \\%(\\mathrm{n}=5 / 29)$. In all patients $(\\mathrm{n}=9)$ undergoing surgery for appendicitis, LDCT suggested the diagnosis. In 2 patients, LDCT showed an alternate diagnosis (ureteral stone and terminal ileitis). In patients with indeterminate CT results, MRI was obtained in 3 patients and standard dose CT in 2 patients. In 2 of these patients, appendicitis was reported on imaging (MRI; $\\mathrm{n}=1, \\mathrm{CT} ; \\mathrm{n}=1$ ) and confirmed at surgery [79].\n\n## Fluoroscopy Contrast Enema\n\nThere is no relevant literature supporting the use of contrast enema in the evaluation of RLQ pain, fever, and leukocytosis in pregnant women.\n\n## MRI Abdomen and Pelvis\n\nSix retrospective studies that are specific to MRI diagnostic performance for appendicitis in pregnant women were identified. One study was multi-institutional in nature, and the remaining 5 were single-institution series.\nThe multi-institution study [80] reviewed 709 pregnant women 16 to 49 years of age with proven appendicitis and preoperative MRI. Gestational age ranged from 1 to 39 weeks, with a mean of $17 \\pm 8.5$ weeks: $49.5 \\%$ second trimester, $34.9 \\%$ first trimester, and $15.6 \\%$ third trimester. Sixty-six of 709 ( $9.3 \\%$ ) patients were diagnosed with appendicitis on MRI, with 61 of 66 proven pathologically. The 5 patients with false-positive results had pathologic diagnoses of torsed right ovary $(\\mathrm{n}=1)$, appendicolith with mild lymphoid hyperplasia $(\\mathrm{n}=1)$, fibrous obliteration of the appendiceal lumen without changes of appendicitis $(\\mathrm{n}=1)$, and normal appendices $(\\mathrm{n}=2)$. Pooled sensitivity, specificity, accuracy, PPV, and NPV were $96.8 \\%, 99.2 \\%, 99.0 \\%, 92.4 \\%$, and $99.7 \\%$, respectively. The pooled AUC was 0.98 ( $95 \\%$ CI: $0.96-1.0$, range $0.83-1[P=.12-.99]$ ). Other diagnoses were identified in 72 of the remaining 643 patients ( $10.1 \\%$ ). The appendix was not visualized in 207 of $709(29.2 \\%)$ patients.\nThe single-institution studies demonstrated similar performance of MRI in pregnant patients. Theilen et al [81] evaluated 171 pregnant patients with suspected appendicitis who had MRI (1.5T), showing that 53 of 171 (30.9\\%) patients had nonvisualization of the appendix. Of the 118 remaining patients, 18 had MRI evidence of appendicitis and appendectomy. Of these 18 patients, 12 ( $66.7 \\%$ ) were confirmed, yielding MRI sensitivity of $91.7 \\%$, specificity of $95.3 \\%$, PPV of $68.8 \\%$, and NPV of $99.0 \\%$. Of the remaining 6 women who underwent appendectomy, 3 women had no histopathologic abnormality, 1 woman had subserosal histiocytes, 1 woman had fibrous obliteration of the appendiceal lumen, and 1 woman had epithelial hyperplasia and mucocele. An alternate diagnosis on MRI was identified in 74 of 171 (43\\%) women. Ramalingam et al [78] evaluated a multimodality imaging algorithm for the diagnosis of appendicitis in 127 pregnant women. All patients were evaluated with US. US demonstrated 2 patients $(1.9 \\%)$ with evidence of appendicitis. Additionally, 103 of the 125 patients with nondiagnostic US underwent MRI. CT was reserved for patients with equivocal US and MRI, 9 patients ( $8.7 \\%$ ). The sensitivity, specificity, PPV, and NPV for US were $12.5 \\%, 99.2 \\%, 50 \\%$, and $94.4 \\%$, respectively; for MRI they were $100 \\%, 93.6 \\%, 57.1 \\%$, and', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=12, markdown='$100 \\%$, respectively. Diagnostic performance of the multimodality strategy yielded a sensitivity, specificity, PPV, and NPV of $100 \\%, 98.3 \\%, 80 \\%$, and $100 \\%$, respectively. MRI identified 10 additional diagnoses as likely causes of pain.\n\nA comparison study was performed of US in 117 and MRI in 114 of 140 pregnant patients with suspected appendicitis [82]. Appendix visualization rates were $7 \\%$ (8 of 117) for US and $80 \\%$ (91 of 114) for MRI. Identification of alternate pathology was $2.6 \\%$ (3 of 117) for US and $12 \\%$ (14 of 114) for MRI. Diagnostic performance of US yielded a sensitivity of $18 \\%$, a specificity of $99 \\%$, a PPV of $66 \\%$, and an NPV of $92 \\%$. Diagnostic performance of MRI yielded a sensitivity of $100 \\%$, a specificity of $98 \\%$, a PPV of $89 \\%$, and an NPV of $100 \\%$. Diagnosis of appendicitis ( 16 of 18 patients) by MRI was proven by pathology. The 2 false-positive cases were found to be a neuroendocrine tumor and fibrous obliteration of the appendix by endometriosis. A singleinstitution retrospective review of 267 pregnant patients compared NAR before and after introduction of MRI for preoperative evaluation [83]. MRI was performed on 217 patients, 185 following nondiagnostic US. Surgery was performed on 31 patients in the pre-MRI era. The appendix was visualized on MRI in 70 of 217 (32\\%) cases. NAR before MRI was $55 \\%$ (17 of 31 ). Following introduction of MRI, it was $29 \\%$ (15 of 51 ), a $47 \\%$ decrease. MRI yielded a sensitivity of $89 \\%$ (17 of 19), a specificity of $97 \\%$ (187 of 193), a PPV of $74 \\%$ (17 of 23), and an NPV of $99 \\%$ (187 of 189).\n\nA single-institution retrospective study of MRI on a 1.5 T system in 125 pregnant patients with suspected appendicitis investigated the value of additional DWI [76]. The sensitivity, specificity, and accuracy of MRI with DWI ( $\\mathrm{n}=53,100 \\%, 95 \\%$, and $96 \\%$ ) were similar to MRI without DWI ( $\\mathrm{n}=72,100 \\%, 94.7 \\%$, and $95.8 \\%, P=$ .146).\n\nThe ACR Committee on Drugs and Contrast Media recommends the following concerning the performance of contrast-enhanced MRI examinations in pregnant patients: each case should be reviewed carefully by members of the clinical and radiology service groups, and a gadolinium-based contrast agent should be administered only when there is a potential significant benefit to the patient or fetus that outweighs the possible but unknown risk of fetal exposure to free gadolinium ions [84].\n\n# Radiography Abdomen \n\nThere is no relevant literature supporting the use of radiographs in the evaluation of RLQ pain, fever, and leukocytosis in pregnant women.\n\n## US Abdomen\n\nResearch articles investigating role of US in pregnant patients with appendicitis did not consistently differentiate abdominal from pelvic US protocols. The methods sections were reviewed, and, where specified, articles were separated into abdomen or pelvis. Studies of the right iliac fossa were designated pelvis. Articles referring to graded compression US technique [42], those that specified abdomen, and unspecified studies are included in this section.\n\nThree current studies evaluating US for the diagnosis of appendicitis in pregnant patients identified by the search methodology are included. Hiersch et al [85] compared diagnostic performance of US in pregnant $(\\mathrm{n}=81)$ and nonpregnant women $(\\mathrm{n}=243)$. There was no statistically significant difference in predictive performance of US between the 2 groups with a PPV and NPV of $88.2 \\%$ and $100 \\%(P=.011)$ and $92.9 \\%$ and $57.1 \\%(P<.001)$ [85]. In a similar study, Segev et a [86] found no statistically significant difference in predictive performance of US between the pregnant $(\\mathrm{n}=67)$ and nonpregnant women $(\\mathrm{n}=133)$ presenting with suspected appendicitis, with an AUC of 0.76 and 0.73 , respectively $(P=.78)$. Segev et al [86] also performed a subgroup analysis of each trimester and showed that there was no significant difference in the diagnostic performance of US by trimester. First trimester $(\\mathrm{n}=23)$ : AUC 0.73 , second trimester $(\\mathrm{n}=32)$ : AUC 0.67 , and third trimester $(\\mathrm{n}=12)$ : AUC $0.86(P=.4)$.\n\nLehnert et al [87] compared US performance in 99 pregnant women in their second or third trimester. The prevalence of appendicitis was $7.1 \\%$ (7 of 99). US detected only $28.7 \\%$ (2 of 7 ) of appendicitis cases and none of the remaining cases because of nonvisualization of the appendix, $71.3 \\%$ (5 of 7 ).\n\nAs noted above, US performance is confounded by appendix visualization. Rates of nonvisualization in the 2 studies, where it is reported, were $34.1 \\%$ of pregnant and $40.4 \\%$ of nonpregnant patients [85] and $97 \\%$ of all patients not stratified by trimester [87]. There is improved performance when stratified by trimester, $25 \\%$ for first trimester versus $63 \\%$ for third trimester [85], and in the presence of fever in pregnant patients, AUC 0.92 versus $0.72(P=$ .07) [86].', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=13, markdown="# US Pelvis \n\nThere is no recent literature supporting the use of pelvic US in the evaluation of RLQ pain, fever, or leukocytosis in pregnant women.\n\n## WBC Scan Abdomen and Pelvis\n\nThere is no recent literature regarding the use of $\\mathrm{Tc}-99 \\mathrm{~m}$ WBC scan abdomen and pelvis in the evaluation of RLQ pain, fever, and leukocytosis in pregnant women. A historical study retrospectively reviewed performance of Tc99 m WBC scans of 13 pregnant patients with suspected appendicitis. The WBC scan demonstrated a sensitivity, specificity, PPV, and NPV of $50 \\%, 73 \\%, 25 \\%$, and $89 \\%$, respectively [73]. The false-positive rate was $27 \\%$ and the false-negative rate was $50 \\%$. The study is limited by the small sample size but nonetheless demonstrates that Tc99 m WBC scan is not reliable in the pregnant patient with suspected appendicitis.\n\n## Summary of Recommendations\n\n- Variant 1: CT abdomen and pelvis with IV contrast is usually appropriate for the initial imaging of right lower quadrant pain.\n- Variant 2: CT abdomen and pelvis with IV contrast is usually appropriate for the initial imaging of right lower quadrant pain with fever, leukocytosis, and suspected appendicitis.\n- Variant 3: US abdomen or MRI abdomen and pelvis without IV contrast is usually appropriate for the initial imaging of a pregnant woman with right lower quadrant pain with fever, leukocytosis, and suspected appendicitis. These procedures are equivalent alternatives (ie, only one procedure will be ordered to provide the clinical information to effectively manage the patient's care).\n\n\n## Supporting Documents\n\nThe evidence table, literature search, and appendix for this topic are available at https://acsearch.acr.org/list. The appendix includes the strength of evidence assessment and the final rating round tabulations for each recommendation.\n\nFor additional information on the Appropriateness Criteria methodology and other supporting documents go to www.acr.org/ac.\n\n## Safety Considerations in Pregnant Patients\n\nImaging of the pregnant patient can be challenging, particularly with respect to minimizing radiation exposure and risk. For further information and guidance, see the following ACR documents:\n\n- ACR-SPR Practice Parameter for the Safe and Optimal Performance of Fetal Magnetic Resonance Imaging (MRI) $[88]$\n- ACR-SPR Practice Parameter for Imaging Pregnant or Potentially Pregnant Adolescents and Women with Ionizing Radiation [89]\n- ACR-ACOG-AIUM-SMFM-SRU Practice Parameter for the Performance of Standard Diagnostic Obstetrical Ultrasound [90]\n- ACR Manual on Contrast Media [84]\n- ACR Manual on MR Safety [91]", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=14, markdown='Appropriateness Category Names and Definitions\n\n| Appropriateness Category Name | Appropriateness Rating | Appropriateness Category Definition |\n| :--: | :--: | :--: |\n| Usually Appropriate | 7, 8, or 9 | The imaging procedure or treatment is indicated in the specified clinical scenarios at a favorable risk-benefit ratio for patients. |\n| May Be Appropriate | 4, 5, or 6 | The imaging procedure or treatment may be indicated in the specified clinical scenarios as an alternative to imaging procedures or treatments with a more favorable risk-benefit ratio, or the risk-benefit ratio for patients is equivocal. |\n| May Be Appropriate (Disagreement) | 5 | The individual ratings are too dispersed from the panel median. The different label provides transparency regarding the panel\'s recommendation. "May be appropriate" is the rating category and a rating of 5 is assigned. |\n| Usually Not Appropriate | 1,2 , or 3 | The imaging procedure or treatment is unlikely to be indicated in the specified clinical scenarios, or the risk-benefit ratio for patients is likely to be unfavorable. |\n\n# Relative Radiation Level Information \n\nPotential adverse health effects associated with radiation exposure are an important factor to consider when selecting the appropriate imaging procedure. Because there is a wide range of radiation exposures associated with different diagnostic procedures, a relative radiation level (RRL) indication has been included for each imaging examination. The RRLs are based on effective dose, which is a radiation dose quantity that is used to estimate population total radiation risk associated with an imaging procedure. Patients in the pediatric age group are at inherently higher risk from exposure, because of both organ sensitivity and longer life expectancy (relevant to the long latency that appears to accompany radiation exposure). For these reasons, the RRL dose estimate ranges for pediatric examinations are lower as compared with those specified for adults (see Table below). Additional information regarding radiation dose assessment for imaging examinations can be found in the ACR Appropriateness Criteria ${ }^{\\circledR}$ Radiation Dose Assessment Introduction document [92].\n\n| Relative Radiation Level Designations | | |\n| :--: | :--: | :--: |\n| Relative Radiation Level* | Adult Effective Dose Estimate Range | Pediatric Effective Dose Estimate Range |\n| 0 | 0 mSv | 0 mSv |\n| $\\sim$ | $<0.1 \\mathrm{mSv}$ | $<0.03 \\mathrm{mSv}$ |\n| $\\sim$ | $0.1-1 \\mathrm{mSv}$ | $0.03-0.3 \\mathrm{mSv}$ |\n| $\\sim$ | $1-10 \\mathrm{mSv}$ | $0.3-3 \\mathrm{mSv}$ |\n| $\\sim$ | $10-30 \\mathrm{mSv}$ | $3-10 \\mathrm{mSv}$ |\n| $\\sim$ | $30-100 \\mathrm{mSv}$ | $10-30 \\mathrm{mSv}$ |\n\n*RRL assignments for some of the examinations cannot be made, because the actual patient doses in these procedures vary as a function of a number of factors (eg, region of the body exposed to ionizing radiation, the imaging guidance that is used). The RRLs for these examinations are designated as "Varies."\n\n## References\n\n1. Morley EJ, Bracey A, Reiter M, Thode HC, Jr., Singer AJ. Association of Pain Location With Computed Tomography Abnormalities in Emergeny Department Patients With Abdominal Pain. J Emerg Med 2020;59:485-90.', images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=15, markdown="2. Buckius MT, McGrath B, Monk J, Grim R, Bell T, Ahuja V. Changing epidemiology of acute appendicitis in the United States: study period 1993-2008. J Surg Res 2012;175:185-90.\n3. Heverhagen JT, Pfestroff K, Heverhagen AE, Klose KJ, Kessler K, Sitter H. Diagnostic accuracy of magnetic resonance imaging: a prospective evaluation of patients with suspected appendicitis (diamond). J Magn Reson Imaging 2012;35:617-23.\n4. Kim K, Kim YH, Kim SY, et al. Low-dose abdominal CT for evaluating suspected appendicitis. N Engl J Med 2012;366:1596-605.\n5. Ferris M, Quan S, Kaplan BS, et al. The Global Incidence of Appendicitis: A Systematic Review of Populationbased Studies. Ann Surg 2017;266:237-41.\n6. Raja AS, Wright C, Sodickson AD, et al. Negative appendectomy rate in the era of CT: an 18-year perspective. Radiology 2010;256:460-5.\n7. Man E, Simonka Z, Varga A, Rarosi F, Lazar G. Impact of the Alvarado score on the diagnosis of acute appendicitis: comparing clinical judgment, Alvarado score, and a new modified score in suspected appendicitis: a prospective, randomized clinical trial. Surg Endosc 2014;28:2398-405.\n8. Tan WJ, Acharyya S, Goh YC, et al. Prospective comparison of the Alvarado score and CT scan in the evaluation of suspected appendicitis: a proposed algorithm to guide CT use. J Am Coll Surg 2015;220:218-24.\n9. Apisarnthanarak P, Suvannarerg V, Pattaranutaporn P, Charoensak A, Raman SS, Apisarnthanarak A. Alvarado score: can it reduce unnecessary CT scans for evaluation of acute appendicitis? Am J Emerg Med 2015;33:26670 .\n10. Krajewski S, Brown J, Phang PT, Raval M, Brown CJ. Impact of computed tomography of the abdomen on clinical outcomes in patients with acute right lower quadrant pain: a meta-analysis. Can J Surg 2011;54:43-53.\n11. Hendriks IG, Langen RM, Janssen L, Verrijth-Wilms IM, Wouda S, Janzing HM. Does the Use of Diagnostic Imaging Reduce the Rate of Negative Appendectomy? Acta Chir Belg 2015;115:393-6.\n12. Koberlein GC, Trout AT, Rigsby CK, et al. ACR Appropriateness Criteria ${ }^{\\circledR}$ Suspected Appendicitis-Child. J Am Coll Radiol 2019;16:S252-S63.\n13. Mian M, Khosa F, Ali IT, et al. Faster magnetic resonance imaging in emergency room patients with right lower quadrant pain and suspected acute appendicitis. Pol J Radiol 2018;83:e340-e47.\n14. Yun SJ, Kim HC, Yang DM, et al. Diagnostic Usefulness of Low-Dose Nonenhanced Computed Tomography With Coronal Reformations in Patients With Suspected Acute Appendicitis: A Comparison With StandardDose Computed Tomography. J Comput Assist Tomogr 2016;40:485-92.\n15. Kolb M, Storz C, Kim JH, et al. Effect of a novel denoising technique on image quality and diagnostic accuracy in low-dose CT in patients with suspected appendicitis. Eur J Radiol 2019;116:198-204.\n16. Brassart N, Winant C, Tack D, Gevenois PA, De Maertelaer V, Keyzer C. Optimised z-axis coverage at multidetector-row CT in adults suspected of acute appendicitis. Br J Radiol 2013;86:20130115.\n17. Corwin MT, Chang M, Fananapazir G, Seibert A, Lamba R. Accuracy and radiation dose reduction of a limited abdominopelvic CT in the diagnosis of acute appendicitis. Abdom Imaging 2015;40:1177-82.\n18. O'Malley ME, Alharbi F, Chawla TP, Moshonov H. CT following US for possible appendicitis: anatomic coverage. Eur Radiol 2016;26:532-8.\n19. Elbanna KY, Mohammed MF, Chahal T, et al. Dual-Energy CT in Differentiating Nonperforated Gangrenous Appendicitis From Uncomplicated Appendicitis. AJR Am J Roentgenol 2018;211:776-82.\n20. Leeuwenburgh MM, Wiarda BM, Bipat S, et al. Acute appendicitis on abdominal MR images: training readers to improve diagnostic accuracy. Radiology 2012;264:455-63.\n21. Avcu S, Cetin FA, Arslan H, Kemik O, Dulger AC. The value of diffusion-weighted imaging and apparent diffusion coefficient quantification in the diagnosis of perforated and nonperforated appendicitis. Diagn Interv Radiol 2013;19:106-10.\n22. Inci E, Kilickesmez O, Hocaoglu E, Aydin S, Bayramoglu S, Cimilli T. Utility of diffusion-weighted imaging in the diagnosis of acute appendicitis. Eur Radiol 2011;21:768-75.\n23. Inoue A, Furukawa A, Nitta N, et al. Accuracy, criteria, and clinical significance of visual assessment on diffusion-weighted imaging and apparent diffusion coefficient quantification for diagnosing acute appendicitis. Abdom Radiol (NY) 2019;44:3235-45.\n24. American College of Radiology. ACR Appropriateness Criteria ${ }^{\\circledR}$ : Acute Pelvic Pain in the Reproductive Age Group. Available at: https://acsearch.acr.org/docs/69503/Narrative/. Accessed March 31, 2022.\n25. American College of Radiology. ACR Appropriateness Criteria ${ }^{\\circledR}$ : Acute Onset Flank Pain-Suspicion of Stone Disease (Urolithiasis). Available at: https://acsearch.acr.org/docs/69362/Narrative/. Accessed March 31, 2022.", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=16, markdown="26. Keyzer C, Cullus P, Tack D, De Maertelaer V, Bohy P, Gevenois PA. MDCT for suspected acute appendicitis in adults: impact of oral and IV contrast media at standard-dose and simulated low-dose techniques. AJR Am J Roentgenol 2009;193:1272-81.\n27. Pooler BD, Lawrence EM, Pickhardt PJ. Alternative diagnoses to suspected appendicitis at CT. Radiology 2012;265:733-42.\n28. Rud B, Vejborg TS, Rappeport ED, Reitsma JB, Wille-Jorgensen P. Computed tomography for diagnosis of acute appendicitis in adults. Cochrane Database Syst Rev 2019;2019.\n29. Barksdale AN, Hackman JL, Gaddis M, Gratton MC. Diagnosis and disposition are changed when boardcertified emergency physicians use CT for non-traumatic abdominal pain. Am J Emerg Med 2015;33:1646-50.\n30. Wang RC, Kornblith AE, Grupp-Phelan J, Smith-Bindman R, Kao LS, Fahimi J. Trends in Use of Diagnostic Imaging for Abdominal Pain in U.S. Emergency Departments. AJR Am J Roentgenol 2021;216:200-08.\n31. Addiss DG, Shaffer N, Fowler BS, Tauxe RV. The epidemiology of appendicitis and appendectomy in the United States. Am J Epidemiol 1990;132:910-25.\n32. Jones R, Olatunbode D, Dean J, et al. A feasibility randomised controlled trial to evaluate the role of computed tomography in adults with atypical right iliac fossa pain. Ann R Coll Surg Engl 2019;101:546-51.\n33. Kontopodis N, Kouraki A, Panagiotakis G, Chatziioannou M, Spiridakis K. Efficacy of preoperative computed tomography imaging to reduce negative appendectomies in patients undergoing surgery for left lower quadrant abdominal pain. G Chir 2014;35:223-8.\n34. Hlibczuk V, Dattaro JA, Jin Z, Falzon L, Brown MD. Diagnostic accuracy of noncontrast computed tomography for appendicitis in adults: a systematic review. Ann Emerg Med 2010;55:51-59 el.\n35. Leeuwenburgh MM, Wiarda BM, Jensch S, et al. Accuracy and interobserver agreement between MR-nonexpert radiologists and MR-experts in reading MRI for suspected appendicitis. Eur J Radiol 2014;83:103-10.\n36. Petkovska I, Martin DR, Covington MF, et al. Accuracy of Unenhanced MR Imaging in the Detection of Acute Appendicitis: Single-Institution Clinical Performance Review. Radiology 2016;279:451-60.\n37. Byott S, Harris I. Rapid acquisition axial and coronal T2 HASTE MR in the evaluation of acute abdominal pain. Eur J Radiol 2016;85:286-90.\n38. Leeuwenburgh MM, Wiezer MJ, Wiarda BM, et al. Accuracy of MRI compared with ultrasound imaging and selective use of CT to discriminate simple from perforated appendicitis. Br J Surg 2014;101:e147-55.\n39. Leeuwenburgh MM, Wiarda BM, Wiezer MJ, et al. Comparison of imaging strategies with conditional contrastenhanced CT and unenhanced MR imaging in patients suspected of having appendicitis: a multicenter diagnostic performance study. Radiology 2013;268:135-43.\n40. Duke E, Kalb B, Arif-Tiwari H, et al. A Systematic Review and Meta-Analysis of Diagnostic Performance of MRI for Evaluation of Acute Appendicitis. AJR Am J Roentgenol 2016;206:508-17.\n41. Petroianu A, Alberti LR. Accuracy of the new radiographic sign of fecal loading in the cecum for differential diagnosis of acute appendicitis in comparison with other inflammatory diseases of right abdomen: a prospective study. J Med Life 2012;5:85-91.\n42. Puylaert JB. Acute appendicitis: US evaluation using graded compression. Radiology 1986;158:355-60.\n43. Rioux M. Sonographic detection of the normal and abnormal appendix. AJR Am J Roentgenol 1992;158:7738 .\n44. Jeffrey RB, Jr., Laing FC, Townsend RR. Acute appendicitis: sonographic criteria based on 250 cases. Radiology 1988;167:327-9.\n45. Jeffrey RB, Jr., Laing FC, Lewis FR. Acute appendicitis: high-resolution real-time US findings. Radiology 1987;163:11-4.\n46. Garcia Pena BM, Mandl KD, Kraus SJ, et al. Ultrasonography and limited computed tomography in the diagnosis and management of appendicitis in children. JAMA 1999;282:1041-6.\n47. Lee JH, Jeong YK, Park KB, Park JK, Jeong AK, Hwang JC. Operator-dependent techniques for graded compression sonography to detect the appendix and diagnose acute appendicitis. AJR Am J Roentgenol 2005;184:91-7.\n48. Sezer TO, Gulece B, Zalluhoglu N, Gorgun M, Dogan S. Diagnostic value of ultrasonography in appendicitis. Adv Clin Exp Med 2012;21:633-6.\n49. Hasani SA, Fathi M, Daadpey M, Zare MA, Tavakoli N, Abbasi S. Accuracy of bedside emergency physician performed ultrasound in diagnosing different causes of acute abdominal pain: a prospective study. Clin Imaging 2015;39:476-9.\n50. Kouame N, N'Goan-Domoua AM, N'Dri K J, et al. The diagnostic value of indirect ultrasound signs during acute adult appendicitis. Diagn Interv Imaging 2012;93:e24-8.", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=17, markdown="51. D'Souza N, D'Souza C, Grant D, Royston E, Farouk M. The value of ultrasonography in the diagnosis of appendicitis. Int J Surg 2015;13:165-69.\n52. Foley CR, Latimer RG, Rimkus DS. Detection of acute appendicitis by technetium 99 HMPAO scanning. Am Surg 1992;58:761-5.\n53. Soyer P, Dohan A, Eveno C, et al. Pitfalls and mimickers at 64-section helical CT that cause negative appendectomy: an analysis from 1057 appendectomies. Clin Imaging 2013;37:895-901.\n54. Chiu YH, Chen JD, Wang SH, et al. Whether intravenous contrast is necessary for CT diagnosis of acute appendicitis in adult ED patients? Acad Radiol 2013;20:73-8.\n55. Drake FT, Alfonso R, Bhargava P, et al. Enteral contrast in the computed tomography diagnosis of appendicitis: comparative effectiveness in a prospective surgical cohort. Ann Surg 2014;260:311-6.\n56. Xiong B, Zhong B, Li Z, et al. Diagnostic Accuracy of Noncontrast CT in Detecting Acute Appendicitis: A Meta-analysis of Prospective Studies. Am Surg 2015;81:626-9.\n57. Wadhwani A, Guo L, Saude E, et al. Intravenous and Oral Contrast vs Intravenous Contrast Alone Computed Tomography for the Visualization of Appendix and Diagnosis of Appendicitis in Adult Emergency Department Patients. Can Assoc Radiol J 2016;67:234-41.\n58. Uyeda JW, Yu H, Ramalingam V, Devalapalli AP, Soto JA, Anderson SW. Evaluation of Acute Abdominal Pain in the Emergency Setting Using Computed Tomography Without Oral Contrast in Patients With Body Mass Index Greater Than 25. J Comput Assist Tomogr 2015;39:681-6.\n59. Park G, Lee SC, Choi BJ, Kim SJ. Stratified computed tomography findings improve diagnostic accuracy for appendicitis. World J Gastroenterol 2014;20:13942-9.\n60. Lai V, Chan WC, Lau HY, Yeung TW, Wong YC, Yuen MK. Diagnostic power of various computed tomography signs in diagnosing acute appendicitis. Clin Imaging 2012;36:29-34.\n61. Kim MY, Kim Y, Ryu JA, Kim TY. How to evaluate appendices with borderline diameters on CT: proposal of a quick solution to overcome the limitations of the established CT criteria. Acad Radiol 2014;21:1573-8.\n62. Verma R, Grechushkin V, Carter D, Barish M, Pryor A, Telem D. Use and accuracy of computed tomography scan in diagnosing perforated appendicitis. Am Surg 2015;81:404-7.\n63. Azok JT, Kim DH, Munoz Del Rio A, et al. Intraluminal air within an obstructed appendix: a CT sign of perforated or necrotic appendicitis. Acad Radiol 2012;19:1175-80.\n64. Kim MS, Park HW, Park JY, et al. Differentiation of early perforated from nonperforated appendicitis: MDCT findings, MDCT diagnostic performance, and clinical outcome. Abdom Imaging 2014;39:459-66.\n65. Eng KA, Abadeh A, Ligocki C, et al. Acute Appendicitis: A Meta-Analysis of the Diagnostic Accuracy of US, CT, and MRI as Second-Line Imaging Tests after an Initial US. Radiology 2018;288:717-27.\n66. Jones RP, Jeffrey RB, Shah BR, Desser TS, Rosenberg J, Olcott EW. Journal Club: the Alvarado score as a method for reducing the number of CT studies when appendiceal ultrasound fails to visualize the appendix in adults. AJR Am J Roentgenol 2015;204:519-26.\n67. Shah BR, Stewart J, Jeffrey RB, Olcott EW. Value of short-interval computed tomography when sonography fails to visualize the appendix and shows otherwise normal findings. J Ultrasound Med 2014;33:1589-95.\n68. Piyarom P, Kaewlai R. False-negative appendicitis at ultrasound: nature and association. Ultrasound Med Biol 2014;40:1483-9.\n69. Al-Ajerami Y. Sensitivity and specificity of ultrasound in the diagnosis of acute appendicitis. East Mediterr Health J 2012;18:66-9.\n70. Boonstra PA, van Veen RN, Stockmann HB. Less negative appendectomies due to imaging in patients with suspected appendicitis. Surg Endosc 2015;29:2365-70.\n71. Kaewlai R, Lertlumsakulsub W, Srichareon P. Body mass index, pain score and Alvarado score are useful predictors of appendix visualization at ultrasound in adults. Ultrasound Med Biol 2015;41:1605-11.\n72. Lourenco P, Brown J, Leipsic J, Hague C. The current utility of ultrasound in the diagnosis of acute appendicitis. Clin Imaging 2016;40:944-8.\n73. Stewart JK, Olcott EW, Jeffrey RB. Sonography for appendicitis: nonvisualization of the appendix is an indication for active clinical observation rather than direct referral for computed tomography. J Clin Ultrasound 2012;40:455-61.\n74. Shen G, Wang J, Fei F, Mao M, Mei Z. Bedside ultrasonography for acute appendicitis: An updated diagnostic meta-analysis. Int J Surg 2019;70:1-9.\n75. Bondi M, Miller R, Zbar A, et al. Improving the diagnostic accuracy of ultrasonography in suspected acute appendicitis by the combined transabdominal and transvaginal approach. Am Surg 2012;78:98-103.", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700)), OCRPageObject(index=18, markdown="76. Wi SA, Kim DJ, Cho ES, Kim KA. Diagnostic performance of MRI for pregnant patients with clinically suspected appendicitis. Abdom Radiol (NY) 2018;43:3456-61.\n77. Vasileiou G, Eid AI, Qian S, et al. Appendicitis in Pregnancy: A Post-Hoc Analysis of an EAST Multicenter Study. Surg Infect (Larchmt) 2020;21:205-11.\n78. Ramalingam V, LeBedis C, Kelly JR, Uyeda J, Soto JA, Anderson SW. Evaluation of a sequential multimodality imaging algorithm for the diagnosis of acute appendicitis in the pregnant female. Emerg Radiol 2015;22:125-32.\n79. Poletti PA, Botsikas D, Becker M, et al. Suspicion of appendicitis in pregnant women: emergency evaluation by sonography and low-dose CT with oral contrast. Eur Radiol 2019;29:345-52.\n80. Burke LM, Bashir MR, Miller FH, et al. Magnetic resonance imaging of acute appendicitis in pregnancy: a 5year multiinstitutional study. Am J Obstet Gynecol 2015;213:693 e1-6.\n81. Theilen LH, Mellnick VM, Longman RE, et al. Utility of magnetic resonance imaging for suspected appendicitis in pregnant women. Am J Obstet Gynecol 2015;212:345 e1-6.\n82. Konrad J, Grand D, Lourenco A. MRI: first-line imaging modality for pregnant patients with suspected appendicitis. Abdom Imaging 2015;40:3359-64.\n83. Rapp EJ, Naim F, Kadivar K, Davarpanah A, Cornfeld D. Integrating MR imaging into the clinical workup of pregnant patients suspected of having appendicitis is associated with a lower negative laparotomy rate: singleinstitution study. Radiology 2013;267:137-44.\n84. American College of Radiology. ACR Committee on Drugs and Contrast Media. Manual on Contrast Media. Available at: https://www.acr.org/-/media/ACR/Files/Clinical-Resources/Contrast_Media.pdf. Accessed March 31, 2022.\n85. Hiersch L, Yogev Y, Ashwal E, From A, Ben-Haroush A, Peled Y. The impact of pregnancy on the accuracy and delay in diagnosis of acute appendicitis. J Matern Fetal Neonatal Med 2014;27:1357-60.\n86. Segev L, Segev Y, Rayman S, Nissan A, Sadot E. The diagnostic performance of ultrasound for acute appendicitis in pregnant and young nonpregnant women: A case-control study. Int J Surg 2016;34:81-85.\n87. Lehnert BE, Gross JA, Linnau KF, Moshiri M. Utility of ultrasound for evaluating the appendix during the second and third trimester of pregnancy. Emerg Radiol 2012;19:293-9.\n88. American College of Radiology. ACR-SPR Practice Parameter for the Safe and Optimal Performance of Fetal Magnetic Resonance Imaging (MRI). Available at: https://www.acr.org/-/media/ACR/Files/PracticeParameters/mr-fetal.pdf. Accessed March 31, 2022.\n89. American College of Radiology. ACR-SPR Practice Parameter for Imaging Pregnant or Potentially Pregnant Adolescents and Women with Ionizing Radiation. Available at: http://www.acr.org/ /media/ACR/Documents/PGTS/guidelines/Pregnant_Patients.pdf. Accessed March 31, 2022.\n90. American College of Radiology. ACR-ACOG-AIUM-SMFM-SRU Practice Parameter for the Performance of Standard Diagnostic Obstetrical Ultrasound. Available at: https://www.acr.org/-/media/ACR/Files/Practice-Parameters/us-ob.pdf. Accessed March 31, 2022.\n91. American College of Radiology. ACR Committee on MR Safety. ACR Manual on MR Safety. Version 1.0. Available at: https://www.acr.org/-/media/ACR/Files/Radiology-Safety/MR-Safety/Manual-on-MR-Safety.pdf. Accessed March 31, 2022.\n92. American College of Radiology. ACR Appropriateness Criteria ${ }^{\\circledR}$ Radiation Dose Assessment Introduction. Available at: $\\quad$ https://www.acr.org/-/media/ACR/Files/AppropriatenessCriteria/RadiationDoseAssessmentIntro.pdf. Accessed March 31, 2022.\n\n[^0]\n[^0]: The ACR Committee on Appropriateness Criteria and its expert panels have developed criteria for determining appropriate imaging examinations for diagnosis and treatment of specified medical condition(s). These criteria are intended to guide radiologists, radiation oncologists and referring physicians in making decisions regarding radiologic imaging and treatment. Generally, the complexity and severity of a patient's clinical condition should dictate the selection of appropriate imaging procedures or treatments. Only those examinations generally used for evaluation of the patient's condition are ranked. Other imaging studies necessary to evaluate other co-existent diseases or other medical consequences of this condition are not considered in this document. The availability of equipment or personnel may influence the selection of appropriate imaging procedures or treatments. Imaging techniques classified as investigational by the FDA have not been considered in developing these criteria; however, study of new equipment and applications should be encouraged. The ultimate decision regarding the appropriateness of any specific radiologic examination or treatment must be made by the referring physician and radiologist in light of all the circumstances presented in an individual examination.", images=[], dimensions=OCRPageDimensions(dpi=200, height=2200, width=1700))], model='mistral-ocr-2503-completion', usage_info=OCRUsageInfo(pages_processed=19, doc_size_bytes=524679))
"out/RLQ_pain.md") write_text_file(rlq_pain.to_markdown(),
Text successfully written to out/RLQ_pain.md.
9.2 Multiple PDF
from src.mistral_ocr import MyMistralOCRpdfMulti
= MyMistralOCRpdfMulti(dir_input = "docs/input",
mistral_ocr_multi ="out/",
dir_output= "img/",
dir_image =False) inline_image
mistral_ocr_multi.execute()
{'processed': ['docs/input/A fully automated MRI-based deep-learning algorithm for classifying germinomas and nongerminomatous germ cell tumors..pdf',
'docs/input/A comprehensive machine-learning model applied to MRI to classify germinomas of the pineal region.pdf'],
'failed': {},
'total': 2}
= MyMistralOCRpdfMulti(dir_input = "docs/input", dir_output="out/", inline_image=True, filename_suffix="_with_img")
mistral_ocr_multi_w_img mistral_ocr_multi_w_img
MyMistralOCRpdfMulti(dir_input='docs/input', dir_output='out', inline_image=True, filename_suffix='_with_img', filename_prefix='')
mistral_ocr_multi_w_img.execute()
{'processed': ['docs/input/A fully automated MRI-based deep-learning algorithm for classifying germinomas and nongerminomatous germ cell tumors..pdf',
'docs/input/A comprehensive machine-learning model applied to MRI to classify germinomas of the pineal region.pdf'],
'failed': {},
'total': 2}