Building a Second Brain you could Chat(GPT) With

Most knowledge economy careers require us to build up vast mental models of how things work. Using empathy and wisdom we apply this to business challenges and opportunities to generate impact.

With LLM like ChatGPT, there are now new possibilities of interacting with this body of knowledge.

By externalizing our knowledge (in detailed notes) and the using ChatGPT to retrieve, navigate and augment what we already know we could amplify our capabilities.

It is also not a far stretch to opine that this could boost creativity too - as Steve Jobs once mused: “Creativity is just connecting things. When you ask creative people how they did something, they feel a little guilty because they didn’t really do it, they just saw something.”

You might ask - “But why do we want to do that, after all we do know what we already know - can’t we just either simply read our notes or use ChatGPT ‘as is’ ?

I believe this approach offers a few advantages:

  • Retrival Information quality - ChatGPT’s responds improves when we provide relevant information. As these notes are curated, the quality is higher.
  • Contextualization - a LLM could pull in all the relevant information and contextualize some of these concept to the task at hand.
  • Enrichment - We could chain commands to ChatGPT to not just retrieve/process our information but to add anything relevant details from external sources

The Second Brain - Obsidian

To store all my notes, I am a big fan of Obsidian Obsidian’s tag line is:

“A Second Brain For You, Forever”

and many accounts of its fan base would agree without hesitation that it does act as a second brain for many of us.

A Cloudy Brain

Obsidian

The biggest advantage to Obsidian is that locked-in is minimal. This is because everything is stored as a markdown file. This format is non-proprietary and travels well across systems.

A markdown file is just a text file format with some light-weight features that lets you add stylings and formatting. ie, ` # heading ` produces a headline ` - dashes produces list items` (Ppst..! this whole post is actually a Markdown file being processed by Jerkyll!)

By simply throwing your obsidian markdown notes into a Google Drive synced folder this makes bring everything the cloud a clinch. The above is really important as this make your knoweldge accessible both via the program and by you anywhere. Google Drive also has a friendly interface to move files in and out of - this make upkeep of the system trivial.

Personal Knowledge Management System

The second advantage is that Obsidian makes it really easy to implement the Zettelkasten Note-Taking Method which is a good way to collect and synthesize knowledge. This is due to the ease of creating structure through Obsidian’s tagging feature and internal linking feature.

One core idea of zettelkasten is to take notes and let the structure emerge by itself.

In its original conception. A giant box of cards record concepts- each card is an atomic idea that is processed and rewritten in one’s own words A unique id then assigned to each card and then the question is asked - which other cards in the collection is related to this card. Through this process a dense network of ideas is then created by this linkage.

I’ve used a very similar system but have altered it somewhat. (Purists may wag their finger at my approach)

However, I’ve found that building off high quality courses I am able to quickly form the backbone of a collection of notes. Only after I have established good overview and feel for how each concepts relates to each other, then do I switch to the Zettelkaten way of not forcing anymore artificial structure. Some may balk at this liberal interpretation of the system, but this had great mileage for me to speed up my learning

When additional notes are added to the “pre forced structure” those that I found important I synthesised immediately into the structure, interlinking as needed. However, overly detailed non-core references are not heavily processed or re-interpreted and are simply only understood at a general level.

I would take care to tags and links on how the concepts relate to the rest of my notes and rely on the semantic feature of embeddings + chatGPT for future retrival.

Querying And Chatting with The Brain

Most of this integration follows the approaches highlighted in this excellent article from Singapore Government Digital Services - Integrating ChatGPT with internal knowledge base and question-answer platform

The key insight is that ChatGPT is prone to hallucinations especially if the topic is niche enough without enough public data. So instead of directly interacting and chatting with ChatGPT - we first perform a search on our Obsidian database (a vectorized version - more on that later) and then provide relevant documents as context to ChatGPT.

This way chatGPT can answer as though it has been trained with the internal dataset. The code run on Google Colab.

The flow consists of 1) Retrieving documents from Google Drive 2) Chunking up the document into fragments of about 1000 characters 3) The above is turned into embeddings (a long array that represents the semantic meaning of the chuck) 4) The chunks are sent to a vector store 5) a prompt is template prepared 6) The template is chained with the most relevant chunks from the vector store. 7) A user query is fed into the above and sent to open ai api for response

Much of the above is mediate through the Langchian library

First we connect up your Google Drive

from google.colab import auth
auth.authenticate_user()
from google.colab import drive
drive.mount('/content/drive')

Then add your open ai api key into your environment variable - here comes from my Secrets manager for security with resource name omitted

client = secretmanager.SecretManagerServiceClient()
response = client.access_secret_version(request={"name": resource_name})
os.environ["OPENAI_API_KEY"] = response.payload.data.decode('UTF-8')

now we retrieve and chuck up all documents in our drive’s directory

loader = DirectoryLoader("drive/MyDrive/Obsidian/directoryname",glob="**/*.md")
docs = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000 ,chunk_overlap  = 200,
    length_function = len)
texts = text_splitter.split_documents(docs)

these are then turned into embeddings and sent to a vector store.

embeddings = OpenAIEmbeddings(openai_api_key=os.envirn["OPENAI_API_KEY"])
vector_store = FAISS.from_documents(texts, embeddings)

a prompt template is created here that requests response to also include consideration of the context provided and to include reference links


from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template="""Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
add in additional details that you think are relevant and beyond the context provided and answer with: "ADDITIONAL DETAILS:"
If you don't know the answer, just say that "I don't know", don't try to make up an answer and do not add additional details if there is nothing relevant.
----------------
{summaries}"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

A chain is created to combine a retrival of the most relevant document with the template

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256)  # Modify model_name if you have access to GPT-4
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs=chain_type_kwargs
)

from IPython.display import display, Markdown
def print_result(result):
  output_text = f"""### Question: 
  {query}
  ### Answer: 
  {result['answer']}
  ### Sources: 
  {result['sources']}
  ### All relevant sources:
  {' '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
  """
  display(Markdown(output_text))

The finally a query form the user can be captured and sent to the system

query = "I run a ecommerce business what are the important web events to track, the custom parameters that are useful and detail out the GTM steps and GA steps?"
result = chain(query)
print_result(result)

Question:

I run a ecommerce business what are the important web events to track, the custom parameters that are useful ?

Answer:

For an ecommerce business, the main event to track is purchases. However, there are other smaller events/interactions that lead users towards that main goal, such as “add to cart”, “start checkout”, “subscribe to newsletter”, etc. It’s important to identify these events and track them as well. Custom parameters that can be useful for ecommerce businesses include product ID, product name, product category, and transaction ID. To set up event tracking in Google Tag Manager, you should first prepare a list of events and interactions that you want to track. Then, create a spreadsheet of all the tags and their settings that you will implement. This will help you see what you need from your developers and create a solid naming convention Once you have your list of events and tags, you can start setting up your tags in Google Tag Manager. This involves creating a new tag, selecting the appropriate tag type (e.g. Google Analytics), and configuring the tag settings (e.g. tracking ID, event category, event action, etc.). After setting up your tags in Google Tag Manager, you will need to test them to make sure they are working properly. You can do this by using the preview mode in Google Tag Manager and checking the Google Analytics real-time reports.

Sources:

drive/MyDrive/Obsidian/Mar_GA4GTMBQ/GA4GTMBQ/GA4GTMBQ/GTM.md drive/MyDrive/Obsidian/Mar_GA4GTMBQ/GA4GTMBQ/Reference/reference____ event tracking.md drive/MyDrive/Obsidian/Mar_GA4GTMBQ/GA4GTMBQ/Reference/reference____Google Tag Manager.md

Chatting with the Brain

We could see quite a nice result. References are returned for further investigation and verification against hallucinations. The whole knowledge graph is searched for semantically relevant information and is included in the response.

Connections that may not be immediately apparent would also surface. We could even modify the program to make this a continual conversation!

There we have it - a robust system that is easy to maintain that could be used to access our second brain.