Tech Byte by Julia

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance Abstract There is a rapidly growing number of large language models (LLMs) that users can query for a fee.

We review the cost associated with querying popular LLM APIs—e.g. GPT-4, ChatGPT, J1-Jumbo—and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently. 1 Introduction We are in the midst of an explosion of large language models (LLMs). The alluring possibilities of using LLMs for large-scale applications such as commerce, science, and finance have led a growing number of companies (OpenAI, AI21, CoHere, etc.) to offer LLMs as services. While LLMs such as GPT-4 achieves unprecedented performance in tasks such as question answering, using them for high-throughput applications can be very expensive. For example, ChatGPT is estimated to cost over $700,000 per day to operate [Cosa], and using GPT-4 to support customer service can cost a small business over $21,000 a month [Cosb]. In addition to the financial cost, using the largest LLMs encures substantial environmental and energy impact [BGMMS21, WRG+22], affecting the social welfare of current and future generations. There are many LLMs now available via APIs and they charge heterogeneous prices. The cost of using a LLM API typically consists of three components: 1) prompt cost (proportional to the length of the prompt), 2) generation cost (proportional to the generation length), and 3) sometimes a fixed cost per query. We compared the cost associated with using 12 different commercial LLMs from mainstream providers including OpenAI, AI21, CoHere and Textsynth (Table 1). Their cost can differ by up to 2 orders of magnitudes: for example, the prompt cost for 10M tokens is $30 for OpenAI’s GPT-4 but only $0.2 for GPT-J hosted by Textsyth. Given the heterogeneous cost and quality, how to effectively and efficiently leverage the full set of LLM options is a key challenge for pracitioners. If the tasks are relatively simple, then aggregating multiple responses from GPT-J [WK21] (whose size is 30x smaller than GPT-3) offers performance similar to GPT-3 [ANC+22], leading to financial and environmental savings. However, the performance of GPT-J can be much worse on difficult tasks [TLI+23]. Moreover, relying on one API provider is not reliable if that provider becomes unavailable, potentially due to spiking demand. Existing model ensemble paradigms such as model cascade [VJ04, WLM11] and FrugalML [CZZ20, CZZ22] were designed for predictive tasks with a known set of labels and do not account for the full capabilities of LLM. How to use LLMs affordably and accurately therefore calls for new approaches. Our contributions. In this paper, we lay out our vision of a flexible framework that uses LLM APIs to process natural language queries within a budget constraint, termed FrugalGPT. As shown in Figure

Search This Blog

Tech Byte by Julia

Comments

Post a Comment