GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront

5 min readOct 14, 2021

What is GPT-J-6B?

‍GPT-J-6B is an open source, autoregressive language model created by a group of researchers called EleutherAI. It’s one of the most advanced alternatives to OpenAI’s GPT-3 and performs well on a wide array of natural language tasks such as chat, summarization, and question answering, to name a few.‍

For a deeper dive, GPT-J is a transformer model trained using Ben Wang’s Mesh Transformer JAX. “GPT” is short for generative pre-trained transformer, “J” distinguishes this model from other GPT models, and “6B” represents the 6 billion trainable parameters. Transformers are increasingly the model of choice for NLP problems, replacing recurring neural network (RNN) models such as long short-term memory (LSTM). The additional training parallelization allows training on larger datasets than was once possible.‍

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.‍

Training data

‍GPT-J was trained on the Pile, a large-scale curated dataset created by EleutherAI.‍

Training procedure

GPT-J was trained for 402 billion tokens over 383,500 steps on a TPU v3–256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.‍

Intended Use

GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at generating text from a prompt due to the core functionality of GPT-J being to take a string of text and predict the next token. When prompting GPT-J it is important to remember that the statistically most likely next token is often the one that will be provided by the model.‍‍

Use cases

GPT-J can perform various tasks in language processing without any further training, including tasks it was never trained for. It can be used to solve a lot of different use cases like language translation, code completion, chatting, blog post writing and many more. Through fine-tuning (discussed later), GPT-J can be further specialized on any task to significantly increase performance.

‍Let’s look at some example tasks:

‍Chat: open ended conversations with an AI support agent.‍

‍‍

English to french: translate English text into French.‍

‍

Parse unstructured data: create tables from long form text by specifying a structure and supplying some examples.‍

‍

Translate SQL: translate natural language to SQL queries.‍

‍

Python to natural language: explain a piece of Python code in human understandable language.

‍As you can tell, the standard GPT-J model adapts and performs well on a number of different NLP tasks. However, things get more interesting when you explore fine-tuning.‍

Fine-tuning GPT-J

While the standard GPT-J model is proficient at performing many different tasks, the model’s capabilities improve significantly when fine-tuned. Fine-tuning refers to the practice of further training GPT-J on a dataset for a specific task. While scaling parameters of transformer models consistently yields performance improvements, the contribution of additional examples of a specific task can greatly improve performance beyond what additional parameters can provide. Especially for use cases like classification, extractive question answering, and multiple choice, collecting a few hundred examples is often “worth” billions of parameters.

‍To see what fine-tuning looks like, here’s a demo (2m 33s) on how to fine-tune GPT-J on Forefront. There’s two variables to fine-tuning that, when done correctly, can lead to GPT-J outperforming GPT-3 Davinci (175B parameters) on a variety of tasks. Those variables are the dataset and training duration.‍

Dataset

For a comprehensive tutorial on preparing a dataset to fine-tune GPT-J, check out our guide.

‍At a high level, the following best practices should be considered regardless of your task:

Your data must be in a single text file.
You should provide at least one hundred high-quality examples, ideally vetted by a human knowledgeable in the given task.
You should use some kind of separator at the end of each prompt + completion in your dataset to make it clear to the model where each training example begins and ends. A simple separator which works well in almost all cases is “ <|endoftext|> “.
Ensure that the prompt + completion doesn’t exceed 2048 tokens, including the separator.

‍Let’s look at some example datasets:

‍Classification: classify customer support messages by topic.

‍Sentiment Analysis: analyze sentiment for product reviews.‍

Idea Generation: generate blog ideas given a company’s name and product description.

‍

Training duration

The duration you should fine-tune for largely depends on your task and number of training examples in your dataset. For smaller datasets, fine-tuning 5–10 minutes for every 100kb is a good place to start. For larger datasets, fine-tuning 45–60 minutes for every 10MB is recommended. These are rough rules of thumb and more complex tasks will require longer training durations.‍

Deploying GPT-J

GPT-J is notoriously difficult and expensive to deploy in production. When considering deployment options there are two things to keep in mind: cost and response speeds. The most common hardware for deploying GPT-J is a T4, V100, or TPU, all of which come with less than ideal tradeoffs. At Forefront, we experienced these undesirable tradeoffs and started to experiment to see what we could about it. Several low-level machine code optimizations later, and we built a one-click GPT-J deployment, offering the best cost, performance, and throughput available. Here’s a quadrant to compare the different deployment methods by cost and response speeds:‍

Large transformer language models like GPT-J are increasingly being used for a variety of tasks and further experimentation will inevitably lead to more use cases that these models prove to be effective at. At Forefront, we believe providing a simple experience to fine-tune and deploy GPT-J can help companies easily enhance their products with minimal work required. To start using GPT-J, get in touch with our team.‍

Originally published at https://www.helloforefront.com.