Artificial Intelligence (AI), particularly in its generative form, has emerged as a groundbreaking area for innovative advancements, drawing significant focus since ChatGPT's public unveiling in November 2022.
The rapid evolution of generative AI in 2023
by various AI-based tools, including Chat-GPT from OpenAI[1], Marc Duranton
The Dawn of Generative AI – Redefining Innovation
Since the last HiPEAC vision, there was an extraordinary boom of artificial intelligence. This was already forecasted in last year edition, but events went even faster than expected.
Figure 1: Explosion of usage of AI in 2023, from LifeArchitect.ai
The availability to the public of chatbots like ChatGPT demonstrated a leap in the quality and coherence of machine-generated text. Such systems could engage in detailed, context-aware conversations, making the interaction far more natural and human-like compared to previous systems. This was really triggered on November 30th, 2022 when ChatGPT was officially released to the public. ChatGPT presented a user-friendly interface that required no technical expertise to use. Its conversational format made it accessible to a wide range of users, allowing more people to interact with AI technology in a straightforward and intuitive manner. Other companies surfed on this success and proposed alternatives chatbots (such as Anthropic with Claude 2, Google with Bard, …) and other generative AI application, either to generate pictures (Midjourney, Dall-E, Stable diffusion, …), or to generate sound or voices, also emerged. This was a tremendous increase in the traffic accessing to all these web sites, as shown in the following figure.
Figure 2: AI Industry: Traffic Growth between September 2022 and August 2023. From https://writerbuddy.ai/blog/ai-industry-analysis
The introduction of ChatGPT and similar generative AIs has been a major catalyst for the growth and innovation of numerous startups and established companies. Their impact is most evident in customer service, where they powers chatbots that enhance customer experiences through 24/7 support and personalized interactions, leading to improved efficiency and cost savings. In content creation and digital marketing, ChatGPT is utilized by services for generating creative content and marketing copy, streamlining processes for businesses. In the educational sector, ChatGPT aids in creating interactive learning platforms and tutoring services, offering personalized and engaging learning experiences. Furthermore, in the financial sector, ChatGPT and similar systems are integrated into fintech applications for automated financial advising and fraud detection.
Figure 3: The 50 Most Visited AI Tools and Their 24B+ Traffic Behavior. From https://writerbuddy.ai/blog/ai-industry-analysis
It's widely recognized that the usage of AI has been rapidly increasing across various sectors and applications. This surge is driven by advancements in AI technologies, greater accessibility of AI tools, and the growing integration of AI into everyday devices and services.
The introduction by OpenAI of GPTs (custom versions of ChatGPT made by users) and of the GPT store could jeopardize the business of some companies and start-ups.
In the HiPEAC Vision 2023, it was forecasted that large language models will be running at the edge, and now it is already possible and we can even expect that in 2024 they will be in smartphones. Let’s see in a short history how all this happened.
Short history of the recent progress of artificial intelligence
In the realm of technology and innovation, Artificial Intelligence (AI) has long been a subject of fascination and relentless pursuit. It had several ups and downs (“winters”) since the creation of the term “Artificial Intelligence” which was officially introduced by John McCarthy in 1956 during the Dartmouth Conference. In 2012, a significant event in the field of Artificial Intelligence (AI) and machine learning occurred under the supervision of Geoffrey Hinton, a pioneer in the field of neural networks. That year, Hinton and his team made a groundbreaking advancement in the development of deep learning techniques, particularly through their work on a deep neural network. The critical event was the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Hinton’s team, consisting of Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton himself, participated in this challenge with a model called “AlexNet”[1]. This model was a deep convolutional neural network that dramatically outperformed other competitors in the competition (AlexNet achieved a top-5 error rate of approximately 15.3%, which was significantly lower than the 26.2% error rate of the second-best entry in the competition). The success of AlexNet in the ImageNet challenge was a pivotal moment for the field of deep learning. It showcased the effectiveness of deep neural networks in practical applications, particularly in tasks involving visual recognition, and marked a turning point in AI research, leading to a surge in deep learning applications across various domains, including speech recognition, natural language processing, and more.
There are several reasons why this was the right moment for the rebirth of artificial intelligence: 1) the algorithm and topology of the Convolutional Neural Networks, 2) the availability of large database for learning. This was done using “supervised learning” meaning that for each input image, there is a label that explains what is this image. 3) The computing power was available. The training was done on a GPU (NVIDIA GeForce GTX 580) and Alexnet required 262 PetaFLOPS for its 61M parameters.
Another seminal moment in AI research was the publication in 2017 of a landmark paper titled “Attention Is All You Need”[2], introducing the Transformer model, a novel approach to neural network architecture that significantly advanced the field of natural language processing (NLP) and machine learning. Authored by researchers at Google, including Ashish Vaswani and others, this paper presented a new method that departed from the then-standard approaches based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for processing sequential data, such as text. The key contributions and impacts of this paper include:
Introduction of the Transformer model: The paper introduced the Transformer, a type of model that relies entirely on attention mechanisms, eschewing the use of recurrent layers. This was a significant departure from existing models that processed data sequentially. The Transformer instead allowed for much more parallel processing, leading to faster and more efficient training of large models.
Self-Attention mechanism: A core component of the Transformer is the self-attention mechanism. This allows the model to weigh the importance of different parts of the input data differently, making it highly effective for understanding the context and relationships within the data, especially in language tasks.
Impact on NLP (Natural Language Processing) and beyond: The Transformer model rapidly became a foundational architecture in NLP, leading to the development of highly successful models like BERT, GPT, and others. These models have set new standards in a wide range of NLP tasks, such as language translation, question-answering, and text generation.
Enabling large-scale models: The efficiency and scalability of the Transformer architecture made it possible to train much larger models than before. This has been a key factor in the recent trend towards training very large language models that can understand and generate human language with unprecedented fluency and accuracy.
Influence across AI fields: Although initially designed for NLP tasks, the Transformer model’s architecture has influenced other areas of AI as well, including image processing and speech recognition. Its ability to handle sequential data effectively has made it a versatile tool in the AI toolkit.
One year after the publication of Google’s paper, the US company OpenAI[2] created its first Generative Pre-trained Transformer (GPT in short), followed by GPT-2 in 2019, GPT-3 in 2020 etc.
Figure 4: Evolution of Generative Pre-trained Transformers (GPT) in OpenAI (from https://en.wikipedia.org/wiki/Generative_pre-trained_transformer )
Let’s provide a short history of OpenAI’s developments. The first model, GPT-1, was developed a year after the seminal paper on the Transformer architecture was published and released on June 11, 2018. GPT-1 was trained over one month using eight GPUs and had 117 million parameters.
Following GPT-1, GPT-2 was released on February 14, 2019, with a significant increase in computing power, about 88 times greater than its predecessor. GPT-2 comprised 1.5 billion parameters.
Then came GPT-3, released on May 28, 2020. It required 213 times more computational power than GPT-2 and had 174 billion parameters. The training of GPT-3 involved nearly 500 billion tokens, using extensive datasets including Common Crawl (with 570 GB of text), English Wikipedia, and several corpora of books.
An evolution of GPT-3, known as GPT-3.5, was released on March 15, 2022. Subsequently, ChatGPT, a fine-tuned version of GPT-3.5, was introduced on November 20, 2022. This release marked the beginning of a new era for large language models, making this advanced technology more accessible to the public.
Finally, on March 14, 2023, OpenAI released GPT-4. This model is estimated to take approximately 65 times more processing power than GPT-3. Its architecture, though undisclosed, is speculated to contain about 1.8 trillion parameters. It represents a mixture of expert architectures, implying the collaboration of 16 different networks.
Figure 5: Computing power is driving the advance of ai; from GTC 2023 keynote with NVIDIA CEA Jensen Huang.
The landscape of AI underwent a transformative shift with the advent of generative AI represented by Chat-GPT for chatbot, and tools like Dall-E or Stable Diffusion for creating images from text prompt. This new frontier in AI became a particular focal point of global attention following the public release of ChatGPT, which was a fine-tuned version of OpenAI’s large language model and served as a general-purpose chatbot. This launch represented a significant step in AI technology, as ChatGPT was able to engage in a wide range of topics and showcase the potential of generative AI in producing human-like text responses. The introduction of ChatGPT marked a shift in public perception and use of AI chatbots. Before its launch, AI chatbots were not widely regarded as highly functional or reliable (even if Chat-GPT is also generating plausible answers, but not real – it is said it “hallucinates”). However, the capabilities demonstrated by ChatGPT in understanding and generating human-like text quickly garnered widespread attention and user adoption. By January 2023, ChatGPT had surpassed 100 million monthly active users, becoming the fastest-growing app in history at that time, outpacing even major platforms like TikTok and Instagram. This period also saw significant developments in the use of generative AI, as ChatGPT showcased its potential in various applications, from helping with homework to assisting in job applications and even drafting political speeches. Its versatility and user-friendly interface contributed to its rapid adoption and integration into daily life for many users. The success of ChatGPT also influenced major tech companies. Microsoft, for instance, invested significantly in OpenAI and integrated GPT technology into various products, including Bing and Teams.
Generative AI, a subset of artificial intelligence, is distinguished by its ability to create novel content, ranging from text and images to complex code, by learning from vast datasets. This capability also democratized the accessibility of AI-powered tools. The transformative leap can be largely attributed to the advancements in neural network technologies. Neural networks, inspired by the human brain’s structure and functioning, have become the bedrock of modern AI systems. These sophisticated models, trained on extensive datasets, can analyze patterns, make predictions, and generate content with remarkable accuracy and efficiency[3] .
What are foundation models, fine-tuning, and how Large Language Models work?
A language model is a set of rules that enable understanding and generating text in a specific language, such as English. It operates by calculating the likelihood of each word following the previous one in a sentence.
The language model itself does not produce text independently. Instead, it generates probabilities for each word in the vocabulary. To use a language model, the text must first be broken down into semantic units called “tokens,” which can be words or parts of words. These tokens are then converted into numerical vectors, allowing the model to process them.
The “context window” size of the model, which is the amount of text memory it can use for understanding, varies depending on the specific model. Then, a process known as “vectorization” converts the text into numerical vectors for the model to process.
Figure 6 : « A foundation model can centralize the information from all the data from various modalities. This one model can then be adapted to a wide range of downstream tasks. » From « On the Opportunities and Risks of Foundation Models » https://arxiv.org/abs/2108.07258
Once the text is transformed into vectors, the model generates text using a “sampler.” The sampler selects words based on their probability, but the “temperature” can be adjusted to make the generation more creative or more predictable. This temperature setting controls the randomness in the word selection process, balancing between coherent but predictable text and more diverse but less predictable outputs.
The Transformer model enables the creation of large language models, which are more advanced than traditional foundation models. Let’s quickly define what a foundation model is: these are models trained on extensive databases of texts, images, and various other data types. This training method is known as a self-supervised approach, which differs from the one used in conventional convolutional neural networks. In this approach, the model is expected to predict what comes next in a sequence of items. This is achieved through self-supervised learning, where parts of the data are hidden, and the model must guess, for instance, the end of a sentence. It can self-correct by revealing the hidden part of the sentence, allowing it to fine-tune its parameters locally, hence the term ‘self-supervised.’
This breakthrough in model design, especially with Transformer-based models, eliminates the need to explicitly label the entire database (unlike for Convolutional Neural Networks – CNN-). Foundation models are created by feeding a large database into Transformer-based models. These models are not initially designed for specific applications; they accumulate knowledge but are not fine-tuned for specific tasks. They can only centralize, “compress” information from various data sources and modalities. After the resource-intensive training phase, these foundation models are adapted for specific tasks, such as functioning as a chatbot, extracting information, answering questions, or recognizing objects. This adaptation process is known as fine-tuning. Fine tuning is far less intensive that the initial creation of the foundation model, and only requires a smaller set of data.
Taking ChatGPT as an example, it was a fine-tuned version of GPT-3 using a method called Reinforcement Learning with Human Feedback (RLHF)[4]. In this process, a prompt is taken from a dataset, and a human labeler demonstrates the desired output behavior. This data then informs the neural network, in this case, GPT-3, through supervised learning. The network generates several outputs, which are then ranked by human labelers from best to worst, and this data is used to further train the model. The final step involves a loop where the network generates outputs evaluated by a reward model. This model calculates rewards for the outputs, and these rewards are used to update the policy. The human labeling in the first and second steps requires significant human resources, and OpenAI outsourced some of this work to a San Francisco-based firm called Sama, which employed workers in Kenya, Uganda, and India to label data for various Silicon Valley clients including Google, Meta, and Microsoft.
Figure 7: Reinforcement Learning with Human Feedback (used for ChatGPT), from https://openai.com/research/instruction-following
LoRA, which stands for Low Rank Adaptation[3], is an advanced method for fine-tuning Large Language Models (LLMs) more efficiently. Unlike traditional fine-tuning approaches that require adjusting all layers of a neural network, LoRA focuses on fine-tuning smaller matrices that approximate the larger weight matrix of the pre-trained LLM. This method has proven to be effective, sometimes even outperforming full fine-tuning, as it avoids issues like catastrophic forgetting, where the pre-trained model loses its original knowledge during the fine-tuning process.
One of the key benefits of LoRA is that it’s computationally less demanding. Full fine-tuning of LLMs can be resource-intensive and time-consuming, making it a bottleneck for adapting these models to specific tasks. LoRA’s approach is more efficient, leading to faster training and lower computational requirements. For instance, it’s possible to create a fully fine-tuned model on a GPU with only limited amount of VRAM using LoRA. This was demonstrated by a team in Stanford that fine-tuned Meta’s LlaMA foundation model “We introduce Alpaca 7B, a model fine-tuned from the LlaMA 7B model on 52K instruction-following demonstrations. On our preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$)”[4].
Furthermore, the trained weights from LoRA fine-tuning are significantly smaller. Since the original model remains frozen and only new layers are trained, the weights for these layers can be saved as a single, much smaller file. This makes it easier and more cost-effective to share fine-tuned models.
In practice, LoRA has been implemented in Hugging Face’s Parameter Efficient Fine-Tuning (PEFT) library, which simplifies the process of using this technique. For instance, in the context of supervised fine-tuning, a pre-trained model can be further trained to generate text based on provided prompts, following a format that pairs prompts with responses. In essence, LoRA represents a significant step forward in making the fine-tuning of large language models more accessible and efficient, opening up new possibilities for customization and application in various fields.
Furthermore, foundation models demonstrated what’s known as few-shot or zero-shot learning, meaning they can perform tasks with little or no specific training on those tasks, without the need for extensive fine-tuning or adaptation of their parameters, only by adapted prompting.
Zero-Shot Learning: In zero-shot learning, the LLM is able to understand and perform a task it has never explicitly been trained to do. This is possible because of the extensive general knowledge the model has acquired during its initial training on a diverse and broad dataset. When presented with a new task, the model uses its understanding of language and context to infer what is being asked and how to respond appropriately. For example, a zero-shot learning model could be asked to summarize a text or answer a question about a topic it was never specifically trained on, and it would use its general understanding to attempt realizing the task.
Few-Shot Learning: Few-shot learning refers to the model’s ability to learn a new task from a very small number of examples. Unlike traditional machine learning models that require large datasets to learn a new task, LLMs can understand and perform new tasks with just a few examples. This is often done by presenting the model with a prompt that includes a few examples of the task being performed. For instance, if you wanted a model to write poems in a certain style, you might show it two or three examples of such poems, and it would generate similar content.
These properties are key elements in the success of LLMs, both zero-shot and few-shot learning demonstrate the versatility and adaptability of LLMs. These models can apply their extensive pre-existing knowledge to new situations, making them powerful tools for a wide range of applications where gathering large, task-specific datasets is impractical or impossible.
Computing power is the fuel of generative AI
As of July 2023[5], GPT-4 has been reported to have 1.8 trillion parameters distributed across 120 layers, making it more than ten times larger than its predecessor, GPT-3. OpenAI seems to have utilized 16 separate models for GPT-4’s transformer module, each with approximately 111 billion parameters. The training of GPT-4 involved around 13 trillion tokens, encompassing both text-based and code-based data, and included some degree of fine-tuning. The cost for training GPT-4 was estimated at around $63 million, considering the computational power required and the duration of the training period. For inference purposes, GPT-4 operates on a cluster of 128 GPUs, employing techniques such as eight-way tensor parallelism and 16-way pipeline parallelism. GPT-4 also includes a vision encoder, which adds the capability to read webpages and transcribe images and videos. This is in addition to the text processing unit.
From these numbers, it is clear that the evolution of computing power is crucial in the advancement of AI, from convolutional neural networks to large language models. The transition from AlexNet to GPT-3, for instance, highlights the growing demand for computational resources. More optimized computing architectures lead to greater efficiency; for example, moving from general-purpose CPUs to specialized GPUs has resulted in significant gains in operations per energy unit, and further specialization could yield even more efficiency.
Google’s development of specialized AI architecture, known as the Tensor Processing Unit (TPU), is a case of more efficiency by more specialization. The first TPU (TPUv1) was announced in 2016, built on a 28 nm process with a processing power of 92 tera operations per second on integer-8 operations and a power consumption of about 40 watts. The TPUv2, announced in May 2017, utilized 20 nm technology, achieving 45 tera operations per second and supporting a special numerical format specialized for AI known as bfloat16, with an estimated power consumption of 200 to 250 watts. The TPUv3, released in May 2018, also supported bfloat16 and had an estimated power consumption of about 200 watts for 90 TOPS[6]. Google continued to innovate with TPUv4, offering 2.5 to 3.5 times more performance than TPUv3, and the cloud-based TPUv5, which is more efficient and scalable than its predecessor.
NVIDIA’s GPUs have become a central architecture in the world of artificial neural networks. The company began as early as 2016 to expand the functionality of their GPUs beyond just graphics and gaming, utilizing their CUDA programming system. This was triggered in 2012, when a significant advancement occurred when researchers successfully employed (NVIDIA) GPUs to reach human-level accuracy in visual recognition tasks (this was ImageNet – Large Scale Visual Recognition Challenge (ILSVRC) as explained in the beginning of this article). This achievement was a harbinger of more recent innovations, like creating images based on textual descriptions. In response to this evolving field, NVIDIA’s CEO, Mr. Huang, emphasized in a commencement address at National Taiwan University[7] that the company had redirected at that time every facet of its operations to support and progress this emerging domain. This innovation demonstrated that GPUs could be effectively used for neural network development. Since then, NVIDIA has continuously enhanced its GPUs to support neural network-based artificial intelligence, including both convolutional neural networks and transformer-based systems.
One of their significant releases was the H100 GPU, which offered peak performance of 2000 teraflops for floating point 16 operations and up to 4000 TFLOPS for integer-8 running on their Tensor Cores. The NVIDIA Hopper H100 Tensor Core GPU is fabricated using TSMC’s Custom NVIDIA 4N FinFET Process. The H100 GPU marked a substantial increase in performance capabilities. For instance, training a Mixture of Expert transformer with 395 billion parameters on a dataset of 1T tokens took only 20 hours on (8 000) H100, compared to 7 days with the previous generation A100.
Figure 8: Relation between the name of the technology node and its real size (from C.Reita, C.Fenouillet-Beranger – CEA-LETI – 2023 ). What NVIDIA called 4N process is supposed to be a variation of TSMC 5 nm.
In 2023, Nvidia announced the H200 GPU based on the Hopper architecture. The H200 is a powerful GPU offering 141 GB of HBM3e memory at 4.8 TB/s, nearly doubling the capacity of the previous generation H100. This increase in memory bandwidth allows for significantly improved performance. For example, the H100 GPU was observed to be 11 times faster than its predecessors A100 in tasks like inference with GPT-3 models with 174 billion parameters. Looking ahead to 2024, it’s anticipated that the upcoming H200 GPU will offer 18 times the speed of the A100, reflecting almost a 20-fold increase in performance over three years[8].
The market of accelerators chips for IA (mainly at the server side) is very important, and several companies are in competition with NVIDIA (and also current customers of NVIDIA’s GPU are developing their own accelerator chips, like Amazon’s Inferentia and Trainium2, Microsoft’s Maia, Google’s TPU, Alibaba’s Hanguang 800, …), such as Cerebras (making wafer scale accelerators), Grog, _etched, Graphcore, Gyrfalcon, IBM’s Northpole, AMD’s Mi300, Intel’s Gaudi3, etc…
The market estimate for AI accelerator chips from 2024 to 2030 indicates significant growth. According to Deloitte Insights[9], the market for specialized chips optimized for generative AI is projected to exceed US$50 billion in 2024. This forecast represents a major increase from near-zero levels in 2022 and is expected to account for a substantial portion of all AI chip sales in that year. Further, the total AI chip sales in 2024 are predicted to constitute 11% of the global chip market, estimated at US$576 billion.
Looking ahead to 2027, the AI chip market forecasts vary, with predictions ranging from US$110 billion to a more aggressive US$400 billion. However, there are viewpoints that suggest the more conservative estimates may be closer to reality, given various factors influencing the market. These estimates highlight the growing importance and demand for AI accelerator chips, driven by advancements in AI applications and technologies. However, it’s important to note that these projections are subject to change based on market dynamics, technological developments, and other influencing factors.
The emergence of open source (parameters) foundation models
Up until now, Large Language Models (LLMs) have primarily been running on cloud servers using Nvidia GPUs or similar architectures. This includes more specialized architectures from various competitors. This is mainly due to the large size of foundation models. This is indeed necessary for creating the foundation models as their learning phase requires a tremendous amount of computing and a fast access to a very large set of information. It is still the trend to have a model with more parameters to increase its capabilities (but we will see that smaller models, with different architectures – like mixture of experts – could also have comparable performances on tests and benchmarks than models having 10x more parameters).
Figure 10: The always increasing size of foundation models. From https://lifearchitect.ai/models/
Cloud is also the main candidate for inference (use) of the fine-tuned versions of the foundation models, such as ChatGPT. However, hosting these large language models in the cloud has certain drawbacks. One major issue is data confidentiality: transferring data to the cloud raises privacy concerns. For instance, there were incidents where engineers from companies like Samsung allegedly shared sensitive corporate data with chatbot services like ChatGPT. Additionally, cloud server availability isn’t always guaranteed, limiting the accessibility and request capacity for advanced models like GPT-4.0, which is also restricted by a $20 monthly subscription fee.
Given these issues, there’s a growing interest in running foundation models for LLMs locally within enterprises, to avoid sending sensitive information to the cloud. However, training these models is expensive. GPT-3, with its 174 billion parameters, had an estimated training cost of $1.8 million. Similarly, the European model Bloom had an estimated training cost of $2.3 million, and GPT-4’s estimated training cost is around $63 million.
The ecological Impact of training these models is also significant, mainly due to CO2 emissions. For example, training GPT-3 in May 2020 was estimated to produce 502 tons of CO2, equivalent to the complete lifetime emissions of nearly 8 cars, including the fuel consumed. However, there has been improvement in efficiency over time. Meta’s OPT model, with 174 billion parameters, produced only 70 tons of CO2, demonstrating a significant reduction (7x) in emissions in two years. In July 2022, the Bloom model, also with 176 billion parameters, only emitted 25 tons of CO2, so 20x less.
Figure 11: Training large foundation models is not cheap! From “2023 State of AI in 14 Charts” available at https://hai.stanford.edu/news/2023-state-ai-14-charts
In Europe, efforts have been made to develop models like Bloom, an open-source alternative to GPT-3. During one-year, from May 2021 to May 2022, more than 1,000 researchers from 60 countries and more than 250 institutions are creating together a very large multilingual neural network language model and a very large multilingual text dataset on the 28 petaflops Jean Zay (IDRIS) supercomputer located near Paris, France, which mostly uses nuclear energy. The energy generated by Jean Zay is repurposed for heating, contributing to its low CO2 footprint. Bloom was trained on 1.5 terabytes of text, it supports 43 languages and 16 programming languages, and took 118 days of training on 384 A100 GPUs. It also includes smaller versions tailored for instruction following, meaning it can answer questions and act as a chatbot[10].
The democratization of large language models (LLMs) was significantly advanced with the release of Meta’s LLaMA foundation models in February 2023. These models ranged from 7 billion to 64 billion parameters. Initially available only to researchers, they were leaked online a week after their announcement, sparking a surge in the development of fine-tuned large models.
In March 2023, Stanford launched Alpaca[5], a robust, easily replicable instruction-following model based on the LLaMA 7B model, fine-tuned with 52K instruction-following demonstrations. This model was cost-effectively fine-tuned for under $600, and its code was released on GitHub, further driving the development of an explosion of new models based on Meta’s leaked foundation model. The Figure 13 shows how LLaMA triggered a complete ecosystems of Open source models (in fact, we prefer to call them “open parameters” has the training database is not always provided - this is the case for LLaMA and LLaMA-2).
On July 18, 2023, Meta introduced LLaMA-2, a model free for both research and commercial use, easily accessible in repositories and HuggingFace, a California-based company founded by French entrepreneurs, emerged as a main repository for open LLMs. HuggingFace is akin to GitHub for the AI community, hosting over 450,000 models from universities and companies like Meta, but also Microsoft, Stability.ai, Mistral.ai and also OpenAI. The community and HuggingFace have made it possible to download and run models locally on personal computers. Most computers can run models up to 13 billion parameters. Running larger models requires GPUs with larger memory. For example, a 13 billion parameter LLaMA-2 model could run on a Mac Mini with comparable speed to ChatGPT while consuming only 21W.
Figure 12: CO2 impact for the creation of foundation models. From “2023 State of AI in 14 Charts” available at https://hai.stanford.edu/news/2023-state-ai-14-charts
The capability to run LLMs locally on smartphones has led to many developments. Companies like Meta are working (with Qualcomm) to optimize the execution of models like LLaMA-2 directly on smartphone chips. On-device AI offers several advantages, including cost reduction (no entrance fee or cloud costs), improved reliability (it is always available, it doesn’t depend on the load of the servers) and performance, and enhanced privacy and security (everything is running locally, no information leaves the smartphone). It also allows for personalization, like digital assistants that can be refined locally adapted to the user.
In October 2023, Qualcomm announced[11] its Snapdragon 8 Gen 3[7] with AI capabilities to perform multimodal generative AI models, including popular large language and vision models, transformer networks for speech recognition up to 10 billion parameters solely on the device. The Snapdragon 8 Gen 3 will be available in smartphones sold in 2024. The estimated power consumption of the SoC is lower than 10W (TbC).
Figure 13 : An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, we cannot include all the LLaMA variants in this figure, even much excellent work. From [6], also available at https://github.com/RUCAIBox/LLMSurvey
MediaTek also announced[12] its Dimensity 9300 chip[8] in November 2023, capable of running a 7 billion parameter large language model at 20 tokens per second. It can also run a 13 GB model compressed to 5 GB to fit into RAM. The Dimensity 9300 supports NeuroPilot Fusion (which can continuously perform LoRA low-rank adaptation), enabling fine-tuning on the foundation model locally on the smartphone.
From these announcements, it is clear that generative AI, and AI in general will be the key selling factor of 2024 high-end smartphones, and that LLMs or models with up to 10B parameters will be able to run on smartphones, jeopardizing some of the market of the AI in the cloud.
Smaller foundations models are more and more efficient, and can already run on devices like smartphones
Smaller LLMs, such as those with 7 or 10 billion parameters, may not perform at the same level as models with 174 billion parameters or more. However, there have been algorithmic improvements since 2022, leading to a decrease in the computational power needed to train a neural network to achieve similar performance. This reduction in necessary compute power has been observed to double approximately every 16 months, a trend akin to a law for neural networks. For instance, training GPT-3 in May 2020 required seven times more computational power than training Meta’s OPT model two years later in May 2022 for comparable performance.
Figure 15: Total amount of compute in teraflops/s-days used to train to AlexNet level performance. 44x less compute required to get to AlexNet performance 7 years later. From https://openai.com/research/ai-and-efficiency
Meta AI introduced the LLaMA Model Series, designed for high performance within specific computing budgets.
The strategy to create these models involved training smaller models on more data and for longer periods to enhance performance.
The largest model in this series had 65 billion parameters trained on 1.4 trillion tokens. Other models had 6 billion and 13 billion parameters trained on 1 trillion tokens.
These models were distributed under a non-commercial license.
These models were trained on 2 trillion tokens from publicly available data.
The fine-tuning process included Reinforcement Learning from Human Feedback (RLHF), known as the alignment procedure.
They were free for research and commercial use.
Mistral AI, a new Gen AI startup, released the Mistral-7B model.
The model was trained on an undisclosed number of tokens extracted from the open web.
Mistral-7B introduced new architecture choices, including Grouped-Query Attention, Sliding-window Attention, and Byte-fallback BPE tokenizer (that ensures that characters are never mapped to out of vocabulary tokens). The model was released in Base and Instruct Versions and licensed under Apache-2.0.
O1-AI from China introduced the Yi Model family.
The family included two model variations with 6 billion to 34 billion parameters.
The models were bilingual, trained on both English and Chinese, using 3 trillion tokens.
Yi-34/6B-200K supported up to a 200,000 context window and topped the Open LLM leaderboard and new benchmarks.
Upstage AI from Korea released the SOLAR-10.7B model.
The model, with approximately 11 billion parameters, used a new merging technique called "Depth Up-Scaling."
Its architecture was based on Llama2 (7 billion with Mistral 7 billion weights), with two 7 billion models stacked together.
The training data for this model, both open and proprietary, was undisclosed.
Similarly, smaller yet more powerful LLMs are becoming suited for smart devices. New models derived from LLaMA 7B or Mistral 7B are now freely available for download and offer benchmark results close to those of models with 25 times more parameters, like GPT-3.5. For example, Zephyr, a fine-tuned version of Mistral’s 7B model, and OpenChat 3.5, a model fine-tuned using C-RLFT - C(onditioned)-RLFT(reinforcement learning fine-tuning) - with Mistral 7B as the base, both show impressive performance in some benchmarks.
Figure 16: Some "small" models with good results
The competition in the field is intense, with the goal being to create smaller, more efficient LLMs that maintain high performance. There are new techniques emerging constantly. For instance, after the success of the ‘mixture of experts’ approach (demonstrated by Mistral.ai’s Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights), technical developments like ‘depth up-scaling (DUS)’ have enabled more efficient LLMs. Depth up-scaling simplifies the upscaling process of LLMs, as demonstrated by models like SOLAR 10.7 billion parameter LLM (its architecture was based on Llama2 - 7 billion with Mistral 7 billion weights- , with two 7 billion models stacked together), which shows strong performance in various natural language processing tasks.
Multiple optimizations are also contributing to the ability to run LLMs and transformer-based neural networks on devices with lower power and performance. For example, as evidenced in the Figure 18, optimizing a model to run on a CPU with efficient data management can improve efficiency by a factor of 6.7. Transitioning to GPU and leveraging specific floating-point operations further enhances performance. Additionally, optimizing transformer architecture can lead to even greater performance gains. Overall, by optimizing both platforms and algorithms, a significant improvement of factor 810x (nearly 4 orders of magnitude!) can be achieved.
Figure 17 : Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) was released by Mistral.ai in December 2023. Mixtral outperforms LLaMA 2 70B on most benchmarks with 6x faster inference. In particular, it matches or outperforms GPT3.5 on most standard benchmarks. From https://mistral.ai/news/mixtral-of-experts/
With the development of SoC able to execute (and even fine tune) models up to 10s of Billions of parameters, the fact that models of this size can be fine-tuned or architectured to be as performing as larger models[13], it is clear that we are entering in the era of the “continuum of computing”, where some AI tasks will be done locally at the edge, and others still in the cloud. With the progress in model design and on hardware accelerators, (generative) artificial intelligence will run on always “smaller” devices in the future.
Figure 18: Carole-Jean Wu, Meta, “Desiging computer systems for sustainability”, ACACES, Italy, July 2023.
Conclusion and recommendations
The launch of ChatGPT in November 2022 marked a significant milestone in the world of technology, creating a substantial impact. The recent advancements in AI, particularly in generative AI based on transformer technology, heavily depend on hardware technologies. Progress in the foundation models, new models architectures enable smaller models to be efficient for practical tasks. Technology is being propelled forward by the development of new accelerators, which enable smaller models to operate on edge devices.
Looking ahead, the evolution of language models is poised to take an exciting turn. Future language models are expected to move beyond learning words and texts to embracing multi-modality, learning through images and other modalities. This evolution will likely involve a combination of more specialized models with a 'mixture of experts' approach, moving away from monolithic models towards collaborative edge systems. These advanced models are also being linked to digital twins to experience and understand the laws of physics, as exemplified by applications like Isaac Sim[14]. This proactive approach allows models to directly interact with and learn from the world, potentially leading to their integration into robots and other autonomous systems.
Therefore, it is clear that Europe should to take a leading role in the development of foundational AI models, with a focus on aligning them with "European" values. This involves creating and disseminating methodologies and datasets specifically tailored to meet regional requirements, ensuring AI sovereignty and promoting a digital economy reflective of European standards and ethics. In addition, there's an emphasis on promoting open-source models to foster a collaborative AI ecosystem with open access to AI resources. This approach nurtures a culture of shared progress. Furthermore, Europe aims to spearhead the integration of Large Language Models (LLMs) into smart devices, focusing on local solutions and specialized accelerators. This strategy will empower real-time AI applications on edge devices, reduce reliance on centralized data centers, enhance privacy, and increase efficiency. The plan includes supporting both startups and established companies to develop edge AI capabilities, thereby building a decentralized and resilient AI infrastructure.
AUTHOR
Marc Duranton is a researcher in the research and technology department at CEA (the French Atomic Energy Commission) and the coordinator of the HiPEAC Vision 2024.
Slightly helped, prompted and organized by Marc Duranton. Most of the text was generated by Chat-GPT (GPT-4 version). As stated in an earlier HiPEAC Vision, we forecast that at some point in time, the HiPEAC Vision could be mainly written by AI-driven tools. This is an early attempt. We kept the style and repetitions generated by the chatbot on purpose. ↩︎
OpenAI was founded in 2015 by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, Jessica Livingston, John Schulman, Pamela Vagata, and Wojciech Zaremba, with Sam Altman and Elon Musk serving as the initial board members. ↩︎
The reader should have noted that this part was generated by Chat-GPT itself. ↩︎
For more details, see https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback ↩︎
According to https://www.semianalysis.com/p/gpt-4-architecture-infrastructure ↩︎
According to https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/ ↩︎
ChatGPT gave the following reference: https://www2.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2024/generative-ai-chip-market-to-reach-40-billion-in-2024.html ↩︎
https://www.qualcomm.com/products/mobile/snapdragon/smartphones/snapdragon-8-series-mobile-platforms/snapdragon-8-gen-3-mobile-platform ↩︎
From https://www.mediatek.com/products/smartphones-2/mediatek-dimensity-9300 ↩︎
On some benchmarks. However, several specialized models can work together, therefore enlarging the global capabilities of this set of models. This trend of having a set of several more specialized models working together is in line with the ideas of the Next Computing Paradigm, where a federation of smart agent are working collectively to achieve a goal that one agent cannot achieve alone. ↩︎