Search
Close this search box.

Eliminating Thousands of Hours Classifying Data with Langchain

Table of Contents

So… data classification.

The bane of existence for anyone who’s ever tried to make sense of purchase records, supplier names, or financial transactions that seem to multiply like rabbits. You could spend your days in a spreadsheet-induced coma, manually tagging each transaction, or—get this—you could let the machines do it for you.

Enter LangChain, Serper, and embeddings. Sounds fancy, right?

Well, buckle up, because I’m about to show you how these shiny tools can make your life easier, one categorized transaction at a time.

Spend Classification: The Struggle

Imagine:

You’re drowning in a sea of purchase orders, invoices, and receipts. Every supplier’s name starts to blur together, and you’re questioning your life choices as you try to categorize that $5,000 coffee purchase (is it office supplies, or an emergency caffeine budget? Who’s to say?).

Instead of suffering in silence, let’s automate this nightmare.

The Power of Langchain

LangChain is like that friend who always seems to know exactly what to say—even when you don’t. It’s a framework for creating applications using large language models (LLMs). 

But here’s the kicker: it can classify transactions without needing to be spoon-fed every possible category. Zero-shot learning? Few-shot learning? LangChain does it all. Your new AI assistant can categorize transactions it’s never even seen before, which is kind of like giving a dog a treat it’s never tasted and it still knowing it’s delicious.

Let’s look at a simple example of how LangChain can be used to classify transaction data. Suppose we have a list of transactions that need to be categorized into “Office Supplies,” “Travel,” or “Consulting.” In this code snippet, we use LangChain to classify a few example transactions. The AI model reads each transaction description and assigns it to one of the predefined categories.

				
					from langchain.llms import OpenAI
from langchain.chains import SimpleTextClassificationChain

# Initialize the GPT model
llm = OpenAI(api_key="your_openai_api_key")

# Define some example transaction descriptions
transactions = [
    "Purchased pens, paper, and notebooks",
    "Booked a flight to New York",
    "Consulting services for project X",
]

# Define the categories
categories = ["Office Supplies", "Travel", "Consulting"]

# Create a simple classification chain
chain = SimpleTextClassificationChain(llm, categories)

# Classify the transactions
for transaction in transactions:
    category = chain.run(transaction)
    print(f"Transaction: '{transaction}' -> Category: {category}")

				
			

And viola. Watch as LangChain categorizes several thousand rows at a time while you sip your coffee and nod approvingly at your newfound genius.

Happen to have a PC many cores? You can show off and flex by upping the ante with multiprocessing and watch throughput blast off like a MarioKart™ turbo mushroom.

Serper: When LangChain Needs a Little Help

Now, sometimes LangChain isn’t omniscient. That’s where Serper comes in. Need to Google something? Why not let an API do it for you? Serper can scrape the web for additional context when your transaction data is as ambiguous as that “meeting” that really just involved everything everywhere that was happening all at once.

Here’s how you can have Serper search for that supplier you’ve never heard of, while you do literally anything else:

Let Serper Do the Googling

As if things couldn’t be easier, Langchain actually has SDK plugins called tools that seamlessly equip your chain with the ability to search the web.

Now, Serper does the Googling while you pretend you’ve known about “ABC Consulting Services” all along. Foolproof.

				
					from langchain.tools import GoogleSerperAPIWrapper
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

# Initialize the Google Serper API tool
google_search = GoogleSerperAPIWrapper()

# Run the search using Google Serper API
search_results = google_search.run(query)

# Print the results
print(search_results)


				
			

Embeddings: The Buzzword That Actually Does Stuff

You’ve probably heard the word “embeddings” thrown around in tech circles like it’s the next best thing since sliced bread.

tl;dc: It kind of is.

In this case, embeddings help you figure out how similar one transaction is to another—like finding out which of your transactions are all secretly coffee-related expenses.

You know-

For “office supplies.”

Why compare transaction descriptions by hand when you can let math do it for you?

				
					from sentence_transformers import SentenceTransformer, util

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define transaction descriptions and categories
transactions = [
    "Purchased printer ink and office chairs",
    "Booked a hotel for a business trip",
    "Hired an external consultant for market analysis"
]
categories = ["Office Supplies", "Travel", "Consulting"]

# Generate embeddings for transactions and categories
transaction_embeddings = model.encode(transactions)
category_embeddings = model.encode(categories)

# Compare transactions to categories using cosine similarity
for i, transaction_embedding in enumerate(transaction_embeddings):
    similarities = util.pytorch_cos_sim(transaction_embedding, category_embeddings)
    best_match = categories[similarities.argmax()]
    print(f"Transaction: '{transactions[i]}' -> Best Match: {best_match}")

				
			

To give a more honest explanation, we are using a pre-trained embedding model to generate vector representations of transaction descriptions and categories. This is followed by an application of cosine similarity to find the best matching category for each transaction. This method allows for more nuanced classification based on the semantic meaning of the transaction descriptions.

Putting It All Together: A Step-by-Step Approach

Data Preprocessing

Before we can classify our transaction data, we need to ensure it’s in a clean, consistent format. This may involve removing duplicates, standardizing formats, and handling missing data.

				
					import pandas as pd

# Load your transaction data
df = pd.read_csv('transactions.csv')

# Basic data cleaning
df.drop_duplicates(inplace=True)
df.fillna('', inplace=True)  # Handle missing data

				
			

Using a pre-trained embedding model (such as BERT), we generate vector representations for our transaction data and our classification categories. Example:

				
					('Laboratory Stands, Racks & Trays', 0.42857142857142855), ('Pipette Tips', 0.40816326530612246), ('Laboratory Supplies & Fixtures', 0.40816326530612246), ('Emergency Lighting & Accessories', 0.3877551020408163), ('Exterior Lighting Fixtures & Accessories', 0.7857142857142857), ('Interior Lighting Fixtures & Accessories', 0.7857142857142857), 
				
			

This is where LangChain comes in. We can create a custom AI assistant that uses the embeddings to classify transactions. The system can be designed to use both similarity-based matching (using the embeddings) and rule-based logic (for handling specific known categories).

				
					from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Initialize OpenAI embeddings and FAISS vector store for similarity matching
embeddings_model = OpenAIEmbeddings()
vector_store = FAISS(embeddings_model)

# Sample data: List of known categories and example transactions
known_categories = {
    "Office Supplies": ["paper", "pens", "stapler"],
    "Travel": ["flight", "hotel", "taxi"],
    # Add more categories and their specific keywords or transactions
}

# Define a function for rule-based classification
def rule_based_classification(transaction):
    for category, keywords in known_categories.items():
        if any(keyword in transaction.lower() for keyword in keywords):
            return category
    return None

# Define a prompt template for LLM-based classification (using embeddings)
prompt_template = PromptTemplate(
    input_variables=["transaction"],
    template="Classify the following transaction: '{transaction}'"
)

# Set up the LLM chain
llm = OpenAI()  # Replace with your OpenAI API key
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

# Function to classify a transaction
def classify_transaction(transaction):
    # First, try rule-based classification
    category = rule_based_classification(transaction)
    if category:
        return category
    
    # If no rule-based match, use embeddings for similarity-based classification
    transaction_embedding = embeddings_model.embed([transaction])
    result = vector_store.similarity_search(transaction_embedding, k=1)
    
    if result:
        return result[0]["category"]
    
    # If still no match, fall back to LLM-based classification
    return llm_chain.run(transaction=transaction)

# Example transaction to classify
transaction = "Booked a flight to New York"

# Classify the transaction
category = classify_transaction(transaction)
print(f"The transaction '{transaction}' was classified as: {category}")

				
			

For transactions or suppliers that the system is unsure about, we can use Serper to perform web searches and gather additional context. This information can then be fed back into the classification system to improve accuracy.

				
					from langchain.tools import GoogleSerperAPIWrapper
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

# Initialize the Google Serper API tool
google_search = GoogleSerperAPIWrapper()

# Define your prompt template
prompt = PromptTemplate(
    input_variables=["query"],
    template="Search for '{query}' using Google and provide a summary of the top results.",
)

# Set up your LLM chain
llm_chain = LLMChain(
    llm=OpenAI(),  # Replace with your OpenAI API key
    prompt=prompt,
)

# Define a query
query = "latest trends in procurement technology"

# Run the search using Google Serper API
search_results = google_search.run(query)

# Print the results
print(search_results)

# Optionally, you can integrate this with your LLM chain to process the results further
response = llm_chain.run(query=query)
print(response)

				
			

As the system processes more transactions, it can learn and improve its classification accuracy over time. This can be achieved through techniques like few-shot learning, where the system is periodically updated with new examples.

The Fine Print: Reality Check

Now, don’t get too carried away with all this automation talk. There are a few things to keep in mind:

Conclusion: Let the robots start doing this yesterday

So, there you have it. LangChain, Serper, and embeddings are your new best friends when it comes to automating transaction data classification. They’ll save you time, reduce errors, and make you look like a data genius. Just remember to keep an eye on them—after all, even the best AI needs a little human supervision now and then.

Now go automate something, kick back, and enjoy the fact that you’ve just made your life a whole lot easier.

You’re welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *

Share the Post: