Publicidad
Publicidad

Más contenido relacionado

Similar a Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення(20)

Más de Lviv Startup Club(20)

Publicidad

Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення

  1. Large Language Models Overview and challenges
  2. What is a LLM? • It is a probabilistic models that are able to predict the next word in the sequence given the words that precede it
  3. A bit of history • Linguistic features • RNN (LSTM)
  4. Transformers • Attention is all you need • Encoder-decoder • Input sequence as a whole
  5. BERT (Bidirectional Encoder Representations from Transformers) • BERT is pre-trained using a masked language modelling (MLM) task • GPT is pre-trained using a language modeling task, where the model is trained to predict the next word in a sequence of text. • Bi-directional vs uni-directional
  6. GPT (generative pretrained transformers) • GPT is a large-scale language model that is pre- trained on a massive amount of text data using a transformer-based architecture. • GPT uses a transformer-based architecture, which allows it to model complex relationships between words and capture long-term dependencies in a sequence of text. • Despite its impressive performance, GPT still has some limitations, such as a tendency to generate biased or offensive text when trained on biased or offensive data.
  7. GPT-J - GPT-J is one of the large language models, with 6 billion parameters - GPT-J is pre-trained on a massive amount of text data, similar to other GPT models (820 Gb) - GPT-J is open-source, which means that the code and pre-trained weights are freely available
  8. LLaMA (Large Language Model Association) • LLAMA framework breaks down a large language model into smaller components, which are optimized separately and then combined to create a larger • LLAMA framework supports per-channel quantization • "layer dropping” • "knowledge distillation"
  9. LLaMA in details • In LLAMA, the model is divided into multiple sub-models, which are trained independently using different optimization techniques. Each sub-model is designed to capture specific aspects of language, such as grammar, syntax, or semantics. Once the sub-models are trained, they are combined into a larger model by "associating" them together using a set of weights. • The weights in the LLAMA model determine how much each sub- model contributes to the overall prediction. • One approach to creating sub-models is to use a technique called "layer dropping", where individual layers in the model are dropped during training to create smaller sub-models. • Common approach is to use a weighted average of the outputs of the sub-models
  10. Fine-tuning • Fine-tuning for LLMs involves adapting the pre-trained model to a specific task or domain by training it on a new dataset. • Fine-tuning of GPT-J
  11. Alpaca • Fine-tuned from the LLaMA 7B model on 52K instruction- following demonstrations • Behaves qualitatively similarly to OpenAI’s text-davinci-003 • Cheap to reproduce (<600$)
  12. Few-shot learning • Is a Machine Learning framework that enables a pre-trained model to generalize over new categories of data (that the pre-trained model has not seen during training) using only a few labeled samples per class. • Support Set: The support set consists of the few labeled samples per novel category of data, which a pre-trained model will use to generalize on on these new classes. • The requirement for large volumes of costly labeled data is eradicated for training a model because, as the name suggests, the aim is to generalize using only a few labeled samples. • Since a pre-trained model is extended to new categories of data, there is no need to re-train a model from scratch, which saves a lot of computational power. • Even if the model has been pre-trained using a statistically different distribution of data, it can be used to extend to other data domains as well, well, as long as the data in the support and query sets are coherent.
  13. Quantization • It is the process of mapping continuous infinite values to a smaller set of discrete finite values • In ML: converting the weights and activations of the model from their original high-precision format (such as 32-bit floating point numbers) to a lower-precision format (such as 8-bit integers) • Bitsandbytes library • Colab with GPT-J
  14. Running locally • C++ wrapper • 7B model • 13 Gb -> 4Gb
  15. Future of LLM? • Supervised learning (SL) requires large numbers of labeled samples. • Reinforcement learning (RL) requires insane amounts of trials. • Self-Supervised Learning (SSL) requires large numbers of unlabeled samples. • LLMs: • Outputs one text token after another • They make stupid mistakes • LLMs have no knowledge of the underlying reality • Letter to pause experiments
  16. Thank you
Publicidad