-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
msc placeholder: LLM as a Google alternative #7438
Comments
We currently want to decide about two things:
Ultimately, injecting databases to LLMs seems really interesting to me. I like the idea of extending LLMs with fact loading and enabling them to reference their sources. Therefore, this kind of direction seems perfect for the thesis. What do you think? @synctext what would be an ideal literature survey topic to help me gain knowledge towards that direction?
The above proposal looks interesting. However, I don't understand why you linked to the Bitcoinlib docs. Any papers you could point me to for the survey? |
Hey Rowdy here, great that you'll be helping out. The superapp is, frankly speaking, a bit of a mess. Please reach out to me by email ([email protected]) to arrange a meeting to discuss the superapp. The last time we can have a face-to-face meeting is the 18th of July, after that, it'll have to be remote. The current suspects of causing issues within the superapp:
Also, there are no e2e tests: we could use Espresso tests for the app. |
Discussed focus of survey, summer job, and thesis. Lets do Kotlin 🚀
|
I created a parent issue just for my summer work on the superapp: From now on, I'll be exposing my findings and progress regarding the superapp there. |
Papers I found on data-augmentation of GPT LLMs:
|
Literature Survey: Augmenting LLMs with Knowledge RetrievalOverleaf Project: https://www.overleaf.com/read/fwyqhjskmdrc I've been reading through a number of papers, the most recent one being: Internet-Augmented Dialogue Generation, by Facebook AI Research. This paper proposes a system that combines:
It provides a nice overview of different methods of klowledge retrieval (using neural networks and an unstructured knowledge base), and it also cites the original papers:
I plan to read through these papers by August 20th and informative summaries for each of the methods. One paper that summarizes all of the above (FiD and RAG) is: There are also a number of papers talking about augmenting LLMs with a structured knowledge base (graph):
Google BardGoogle's AI experiment is called Bard. It uses knowledge retrieval and it is inspired by the following two papers: |
Summary of paper about RAG (Retrieval Augmented Generation): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)Preliminaries
|
Summary of Paper about FiD (Fusion in Decoder): Leveraging Passage Retrieval with Generative Models for Open Domain Question AnsweringPreliminariesGenerative Models vs Extractive ModelsGenerative Models are trained to produce new text. They do this by learning the statistical relationships between words and phrases in a large corpus of text. When given a prompt, a generative model will try to produce text that is consistent with the statistical patterns it has learned. NOTE: The authors of this paper interestingly found that, when increasing the size of the text database, become better and more accurate, contrary to extractive models. Extractive Models are trained to find specific pieces of information in a text, that may be answering a question or identifying the main points of a passage. When given a query, an extractive model will return the parts of the text (spans) that it believes are relevant to the query. SpansSpans are pieces of text that are likely to be the answer to a question. For example, if the question is "What is the color of the cat?", an extractive model might extract the span: "The cat is black" as the answer. OverviewOverall, the idea behind this paper is quite similar to the idea behind RAG (#7438 (comment)), but with a twist... Again, we have two main components:
The main difference between FiD and RAG is that:
|
Augmenting LLMs with Knowledge GraphsGraft-NetPreliminariesQuestion SubgraphsA question subgraph is a subgraph of the knowledge base in which we have pruned the irrelevant (to a given question) nodes and edges. In addition, we have pruned the irrelevant documents as well, and we keep the ones that are likely to contain the answer. The Knowledge BaseTriplestore Knowledge BaseA Triplestore knowledge base is a database that consists of subject-predicate-object triples. An example of such a triple is: (Subject: Albert Einstein, Predicate: was born in, Object: Ulm, Germany). Triples are a great form of representing factual knowledge because they capture the nature of the relationship between a subject and an object and can be easily processed by LLMs. We can view this Knowledge Base as a graph whose vertices are the various subjects and objects (entities) and the predicates are the edges between these entities. Each edge has a type that describes the kind of the relation between the connected entities. Text CorpusA text corpus D is a set of documents {d1, . . . , d|D|} where each document is a sequence of words di = (w1, . . . , w|di|). Specifically, in the context of this paper, a document is essentially a sentence, and an article is a collection of documents. NOTE: It has a similar structure to the knowledge-base from RAG or FiD. Entity LinkingWe assume that there is a set L of links (v, dp) connecting entity v with a word at position p, in document d. Graph Convolutional Network (GCN)GCNs are great for classification of nodes in a graph-structured knowledge base. Here's how a GCN works for an input graph:
NOTE: The more layers the GCN has, the more multi-hop reasoning the model will be able to perform, because it will gather information from more far away neighbors. Relational GCNOne problem arises when the knowledge-base graph heterogeneous (more than one types of relations between entities). In that case, we want to take into consideration the type of relation that a node has with its neighbors before we average the embeddings. A relational GCN is similar to a regular GCN, but it uses a separate matrix for each type of relation. Therefore, when using a relational GCN, we aggregate the embeddings from all neighbors with a specific relation and we pass the averaged embedding into a separate CNN layer for each relation. LuceneLucene is a Java library created by Apache that facilitates data search in a large corpus of text. OverviewQuestion Subgraph RetrievalThe retrieval of the question subgraph, Gq happens in two parallel pipelines:
Knowledge Base RetrievalDuring the knowledge base retrieval, we retrieve a subgraph of the triplestore knowledge base as follows:
Text RetrievalDuring the text retrieval phase, we retrieve documents (sentences) relevant to the question from the Wikipedia database. The text retrieval phase entails the following steps:
The Final Question GraphThe final question graph Gq consists of:
NOTE: Because the verticies of the graphs can be either entities or documents, the graph is considered heterogeneous. Overview of Graft-NetGraft-Net consists of the following stages:
Pull-NetPull-Net uses the text corpus to supplement information extracted from the Triplestore in order to answer multi-hop questions. The subjects and objects in the triples contain links to relevant documents in the text corpus. PullNet uses these links to produce more factually-based answers. Like GRAFT-Net, Pull-Net has an initial phase where it retrieves a question subgraph Gq. However, Pull-Net learns how to construct the subgraph, rather than using an ad-hoc subgraph-building strategy. More specifically, PullNet relies on a small set of retrieval operations, each of which expands a graph node by retrieving new information from the knowledge base or the corpus. PullNet learns when and where to apply these “pull” operations with another graph CNN classifier. The “pull” classifier is weakly supervised, using question-answer pairs. The end result is a learned iterative process for subgraph construction, which begins with a small subgraph containing only the question text and the entities which it contains, and gradually expands the subgraph to contain information from the knowledge base and corpus that are likely to be useful. The process is especially effective for multi-hop questions |
Note the mission of the lab is new fundamental theory, with practical grounding (re-invent The Web, Web3). This means we are not interested in new machine learning theory. It is a tool which failed us in 2005, and now finally might become production usable in 2028. We have now several phd and msc students active on Machine learning:
"Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback", very facinating literature. All very detailed stuff and high-performance. Totally unsuitable for decentralised context with 1-2 billion connected smartphones with 8 cores each on average = 8-16 billion embedded CPU cores 😲 brainstorm For achieving superhuman intelligence we need to invent a paradigm for storing all human knowledge and making it accessable for artificial reasoning engines or language models. @kandrio original thought, LLM are simply to huge to work with practically. If we are able to split the facts and the language model part we enable further growth. The mixing of knowledge and language is sub-optimal. We only need a new model of intelligence to fix this 😄 Bridge the semantic gap. Another old problem known for decades is the problem of ambiguity and synonyms when adding new facts. Just adding a fact also implies embedding it and adding metadata. Establishing global consensus on The Internet on facts is notoriously hard. We failed to solve digital democracy on fact writing. Crowdsourcing LLM augmentation is unsolved. Metadata pollution will severely cripple your system performance, see the detailed overlapping issue of Is Justin Bieber Gay?. Currently the human working at OpenAI decide on 4Chan/Reddit filtering versus unfiltered inclusion into their LLM. These OpenAI developers can also decide to feed live events into their LLM using an unfiltered Twitter feed: real-time event awareness. Taxonomy of update "LLM @ Android" Already very challenging and very sufficient for a TUDelft master thesis. Can you do minimal TFLite finetuning with size of LLM? "On-device LLM finetuning" |
Atlas (next generation of RAG): Few-shot Learning with Retrieval Augmented Language Models (2022)Atlas is essentially the next generation of RAG, for few-shot learning tasks. When performing a task, from question answering to generating Wikipedia articles, Atlas starts by retrieving the top-k relevant documents from a large corpus of text with the retriever. Then, these documents are fed to the language model, along with the query, which in turn generates the output. Both the retriever and the language model are based on pre-trained transformer networks. Atlas consists of:
RetrieverLike RAG, it entails a BERTq and a BERTd encoder. Unlike RAG, during fine-tuning of the retriever, Atlas trains both BERTq and a BERTd (not only BERTq). Hence, the BERTd embeddings for each document in the BERTBASE need to be regularly updated so that they are in-sync with the updated BERTd. This is a computationally expensive task. IMPORTANT: Atlas proposes jointly pre-training both the retriever and the generator model (similar to REALM) unlike RAG which uses pre-trained models and trains end-to-end only during fine-tuning. |
REALM: Retrieval-Augmented Language Model Pre-Training (2020)The first method to pre-train jointly the retriever and the generator. REALM uses an architecture that we've seen before (in RAG, FiD), but proposes a pre-training technique that yields great models. ComponentsJust like RAG, we have two main components:
In REALM, all of the above models are trained during pre-training. InitializationAt the beginning of training, if the retriever does not have good embeddings for Embedinput(x) and Embeddoc(z), the retrieved documents, To avoid this cold-start problem, the authors warm-start the retriever (Embedinput + Embeddoc) using a simple training objective known as the Inverse Cloze Task (ICT) where, given a sentence, the model is trained to retrieve the document where that sentence came from. For the generator, the authors warm-start it with BERT pre-training. Specifically, they use the uncased BERT-base model (12 layers, 768 hidden units, 12 attention heads). Pre-trainingThe unsupervised pre-training method that REALM proposes goes as follows:
Computational ChallengesDuring pre-training, both the Embeddoc and the Embedinput are trained. Because the Embeddoc is updated during pre-training, after each backpropagation step, we need to:
This is a computationally expensive task, especially for huge databases, such as Wikipedia which they used in this paper. So, the authors designed REALM such that the embedding updates happen every 100 backpropagation steps, as an asynchronous process. Fine-tuningThe supervised fine-tuning method that the authors used in order to evaluate REALM on Open-domain Question Answering (Open-QA) goes as follows:
|
RETRO: Improving Language Models by Retrieving from Trillions of Tokens (2022)This paper's breakthrough is that it managed to pre-train and augment a relatively small LLM (25×fewer parameters than GPT-3) with a database that is 2 trillion tokens large (1000×larger than similar retrieval-augmented LLMs). One main difficulty with augmenting LLMs with external knowledge-bases is that training the retriever component can be computationally expensive, because while the document encoder becomes better, we need to re-compute the embeddings for each passage in the database. In this paper, they used a pre-trained document encoder, so they calculate the document embeddings once and they do not update them again . Therefore, the main bottleneck that they're facing when accessing the external database is to find the K nearest documents to the input query. One main difference with related work is that in RETRO they don't retrieve single sentences, but chunks (a retrieved sentence along with the following sentence). I don't yet understand if that helps. OverviewHere's an overview of how RETRO produces an answer to an input query,
RETRO manages to perform attention in complexity that is linear to the number of retrieved passages. |
LaMDA: Language Models for Dialog Applications (2022)In this paper by Google, the authors manage to augment a language generation model with what they call a Toolset (TS). The Toolset (TS)The Toolset consists of:
The Toolset takes a single string as input and outputs a list of one or more strings. Each tool in TS expects a string and returns a The information retrieval system is also capable of returning snippets of content from the open web, with their corresponding URLs. The TS tries an input string on all of its tools, and produces a final output list of strings by concatenating the output lists from every tool in the following order: calculator, translator, and information retrieval system. A tool will return an empty list of results if it can’t parse the input (e.g., the calculator cannot parse “How old is Rafael Nadal?”), and therefore does not contribute to the final output list. NOTE: Little information is given on how the information retrieval system works, apart from the fact that it entails a database, but also can provide web snippets along with their URLs. The ArchitectureLaMDA consists of two main sub-models:
|
Internet-Augmented Dialogue Generation (2021)Their method consists of two components:
We can train each of these modules separately if we have supervised data available for both tasks, the first module requiring (context, search query) pairs, and the second module requiring (context, response) pairs. The search engine is a black box in this system, and could potentially be swapped out for any method. In IADG, they use the Bing Search API for their experiments to generate a list of URLs for each query. Then, they use these URLs as keys to find their page content. |
SeeKeR: Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion (2022)One model to do both retrieval and generation (wow) |
Draft@synctext Here is the first complete draft of my literature survey: Here's a snippet of my taxonomy table: What do you think? |
Code ImplementationI recently dived into the implementation details of Retrieval-Augmented Generation (RAG), one of the most influential papers that I had to review for my Literature Survey (see this comment for a comprehensive review). RAG focuses on knowledge-intensive NLP tasks, as opposed to dialogue intensive tasks that a number of recent papers focus on. The authors of RAG, have open-sourced a specific version of their work, RAG-token, as part of the I was able to access that model, and write an example script where I employed RAG to answer a simple question: "Who holds the record in 100m freestyle?" Here is my script: from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
# a tokenizer receives an input text and breaks it into a list of tokens
# this way, it's easier for the model to understand the input query
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
# initialize a pre-trained RAG Retriever which has access to a "dummy" subset of Wikipedia
retriever = RagRetriever.from_pretrained(
"facebook/rag-token-nq",
index_name="exact",
use_dummy_dataset=True)
# initialize the RAG-token model that will generate the final answer to our query
# the generator of RAG-token will receive the retrieved evidence by the retriever
# along with the input question and it will produce an answer
model = RagTokenForGeneration.from_pretrained(
"facebook/rag-token-nq",
retriever=retriever)
# define our question, and tokenize it. Correct answer should be "michael phelps"
input_dict = tokenizer.prepare_seq2seq_batch(
"who holds the record in 100m freestyle",
return_tensors="pt")
# pass the question as input to RAG-token
generated = model.generate(input_ids=input_dict["input_ids"])
# print the answer
print(tokenizer.batch_decode(generated, skip_special_tokens=True)) Here is a screenshot that shows what RAG replied to my question (take a look at the bottom): CC @synctext |
WOW 👏 Impressive work. Only very minor comments:
|
Updated version of the paper after @synctext's useful comments: |
brainstorm Survey+thesis
1 course msc for Q1 left. Did ML course and industry Kubernets experience. prior google summer of code experience. Python == main working language. Possibly: https://bitcoinlib.readthedocs.io/ on Python side 💶 and LLM/semantic search from systems side for ECTS 🏫
survey ideas: guide to cloud-free local-first LLM. Both training, re-training, and inference.
Thesis could go into numerous directions
The text was updated successfully, but these errors were encountered: