Search
Close this search box.

Captivate An Audience Using Stunning Open-Source Data Visualization Tools

As an aside; I’m still trying to figure out an identity for this here corner of ye’ internet. The feedback on the previous article was galvanizing, manifesting a commitment to produce quality content. Feedback is welcomed and your voices are heard. Most of all, thanks for finding your way back to Things with Stuff.

  • Visual Studio Code
  • Python ver. 3.8+
  • AI Assistant (Claude 3.5 Sonnet > GPT 4o at the time of writing)
  • Visual Studio Code
  • Python ver. 3.8+
  • AI Assistant (Claude 3.5 Sonnet > GPT 4o at the time of writing)

Introduction

I started learning Python a little over a year ago after having learned that children across the world were learning programming in grade school. It spurred this belief that a conversational wherewithal of programming in pseudo-code were becoming a generational prerequisite.

Pseudo-Code Pseudocode is an informal, high-level description of a computer program’s logic. The phrase ”if this, then that’ is a fairly accurate representation of a application’s work instructions. For example, IF weather = rain, THEN backpack contains umbrella, or our THAT in this case.

And yet, I’m stuck between belief & doubt regarding the amount of savvy one needs to thrive in this world. While Nvidia CEO Jensen Huang thinks AI will make coding obsolete, there’s an abundance of evidence that suggests otherwise.

Nonetheless, I don’t regret anything. The incalculable wealth of knowledge and experience across the internet has brought me to the philosophy that “if you can conceive of it (within reason), you can build it”. The abundance of open-source software and public GitHub repositories facilitates a sort of contradiction where solutions to many problems are abundant, but none of them are your solution. The result becomes a gauntlet of critical thinking & problem solving triggering a satisfying dopamine rush at the finish line.

Your milage may vary and you may not find the same degree of near-masochistic mental stimulation. But at the very least, I can put myself through the ringer and struggle through the hard part so you don’t have to. 

Context

Big data, which visualizations like this represent, are publicly available from organizations like NHS

Unstructured or dirty data is as commonplace and in circulation as the litter occupying our city streets and gutters. Manufacturing pretty dashboards like this are often sourced from data characterized by a lack of organization, consistency, and cleanliness. Typical traits of such include mixed data types (which I’ve come to refer to as ‘Jellybeans’) , duplicates, missing or incorrectly formatted data. While mishandling is typically a prime enabler of dirty data, there are many scenarios it’s unavoidable. Such as:

  • Customer Feedback: Free-form text entries from surveys, reviews, and social media comments. These can contain misspellings, slang, and varying levels of detail.
  • Email Communications: Texts from emails which can be lengthy, contain nested threads, attachments, and varying formats that make automated parsing challenging.
  • Scanned Documents: PDFs or images of documents that require OCR (Optical Character Recognition) to convert into text, often resulting in errors and inconsistencies.
  • Web Scraping: Data extracted from websites that often comes in an unstructured format, with HTML tags, scripts, and styles that need to be cleaned and parsed.

Understandably, many people may not consider the janitorial effort in presenting manicured dashboards and reports. Below is a table illustrating data that could be considered in rough shape. Today, we’ll share some tactics that can add to your arsenal to efficiently manage the invisible work that goes into priming data for a dashboard. 

ID Name Email Age Join Date Department Salary
1 John Doe john.doe@example.com 29 2021-06-15 Sales $50000
2 Jane Smith janesmithATexample.com 34 2015-07-20 Marketing 62000
3 Sam O’Neil sam.oneil@example N/A 2020-11-03 IT 70,000
4 Mark Brown mark.brown@example.com 41 2021-09-23 HR 68000
5 Emily Johnson emily.johnson@.com 27 2022-02-30 IT -45000
6 Michael Lee 25 2020-14-12 Engineering 72.000
7 Robert Brown robert.brown@example.com 45 2021-06-15 Sales N/A

Toolkit

Installing Visual Studio, Conda, and Python

Visual Studio Code is a powerful Integrated Development Environment (IDE) that makes it easy to write, test, and deploy code. To get started, simply download the latest version from the official Microsoft website and follow the on-screen instructions.

Anaconda, or Conda, is a popular open-source package manager and environment management system that simplifies the process of installing, running, and updating Python packages. You can download the Anaconda distribution, which includes Conda, Python, and a variety of scientific computing libraries, from the Anaconda website.

We’re going to skip the installation of the two since they are fairly straight forward, and instead, and pick up inside our Integrated Development Environment, or IDE. 

Cleaning & Staging Our Data

I’ll be the first the admit that playing the role of data custodian is not fun.  It’s prone to mistakes while necessitating a peculiar breed of patience & diligence. But that’s not the reason we’re here, right? As you progress through this exercise, keep a running tally of the comparative duration of cleaning data through excel. No pain, no pleasure and the juice is worth the squeeze.

Pandas & Jupyter

Pandas is a powerful data manipulation library in Python, and Jupyter is an interactive notebook environment that allows you to run code and combine your output with markdown to help interpret data. Humor me with this because I really love metaphors.

Imagine Python as the master chef in a hyper efficient kitchen, and pandas as their skilled sous-chef. Together, they transform raw ingredients (data) into exquisite dishes (insights). Jupyter can be seen as the dining room where the master chef and sous-chef  present their culinary creations to the guests (data scientists and analysts).

    • Python: The master chef, overseeing the entire process, ensuring that every step is executed with precision. It handles the broad tasks, directs the flow of the kitchen, and integrates various techniques and tools to create a seamless culinary experience.

    • pandas: The sous-chef, focusing on the meticulous preparation and initial processing of ingredients. It chops, dices, and preps the data, making it ready for the master chef to work their magic. Pandas ensures that the data is clean, organized, and in the perfect state for further analysis.

    • Jupyter Notebook: The dining room, designed for showcasing the culinary (data) creations in an interactive and visually appealing manner. It’s where the master chef presents each dish step-by-step, allowing the guests to appreciate the process, the ingredients, and the final product.

The terminal is your machine’s command-line interface that enables interaction with your computer’s operating system directly. Whether you’re using a Windows or Mac machine, the terminal provides a versatile platform for executing various commands, including installing python libraries like data analysis tools like Pandas and Jupyter Notebook.

				
					pip install pandas jupyter
				
			
Pip, Libraries, and Virtual Environments

Pip is a powerful package manager for Python, enabling easy installation and management of libraries.

Libraries are collections of pre-written code that extend Python’s functionality.

Virtual Environments (venv) isolate project dependencies, ensuring consistent development and deployment

If there were ever a good time to stretch, this the moment that any ritualistic urges should flourish. And if you’re still with me and have a Jupyter notebook open, we can import pandas and parse our csv into a dataframe. We’re going to draw from a more real world example for this:

import_libs_and_df
In [1]:
import pandas as pd
try:
    df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
    df.head()
except Exception as e:
    print(e)
Error tokenizing data. C error: Expected 1 fields in line 11, saw 5

If I were to translate this into human-speak, it would resemble something like:

“Hi pandas, I’m going to call you pd”

“pd, can you try to create a dataframe, df. I need it filled with everything contained in this csv. Then show me the first 5 rows using df.head() to validate?”

“I’m not saying something is going to go wrong (except Exception as e:), but if it does, let me know: print(e) what did”

Lo’ and behold, we get an error. But we didn’t even look at our data did we?

Error Handling Exceptions, as the connotation suggests, are a means to manage the routing of our application logic. Establishing the equivalent to checks, stops, & bypasses for occurrences that would result in a failure.
Breakdown
				
					import pandas as pd
				
			

Barring few exceptions, majority of the libraries & modules that Python brings to the table are imported to your script. Remember to maintain environments if your script has any dependencies such as pandas, to prevent any boo-boos.

The as pd part of our statement is signaling to our IDE that we will be abbreviating pandas into pd, simultaneously opening up all of pandas methods to us

				
					try:
    df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
except Exception as e:
    print(e)
				
			

Try-blocks provide a structured approach to error handling, allowing you to anticipate and gracefully manage potential issues that may arise during program execution.

By encapsulating code within a try-block, you can proactively address and resolve errors, ensuring a seamless user experience and maintaining the integrity of your application.

				
					df.head()
print(e)
				
			

head() is a method distinct to pandas dataframes that becomes available to use after reading a csv into the object we defined as df. Its output is a snapshot of the first 5 rows in a table. 

Why this is important is because of Python’s Global Interpreter Lock (GIL), a very fancy way to define the ability to only ‘do one thing at a time.’

 

				
					import pandas as pd
				
			

Barring few exceptions, majority of the libraries & modules that Python brings to the table are imported to your script. Remember to maintain environments if your script has any dependencies such as pandas, to prevent any boo-boos.

The as pd part of our statement is signaling to our IDE that we will be abbreviating pandas into pd, simultaneously opening up all of pandas methods to us

				
					try:
    df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
except Exception as e:
    print(e)
				
			

Try-blocks provide a structured approach to error handling, allowing you to anticipate and gracefully manage potential issues that may arise during program execution.

By encapsulating code within a try-block, you can proactively address and resolve errors, ensuring a seamless user experience and maintaining the integrity of your application.

				
					df.head()
print(e)
				
			

head() is a method distinct to pandas dataframes that becomes available to use after reading a csv into the object we defined as df. Its output is a snapshot of the first 5 rows in a table. 

Why this is important is because of Python’s Global Interpreter Lock (GIL), a very fancy way to define the ability to only ‘do one thing at a time.’

 

When dealing with CSV files, it’s crucial to understand the format and delimiters used in the data. In this scenario, the CSV file is tab-delimited, which means that tabs (t) separate the columns instead of the usual commas. This can sometimes lead to issues when reading the file if the delimiter is not correctly specified.

				
					---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[6], line 3
      1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
      4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
    623     return parser
    625 with parser:
--> 626     return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
   1916 nrows = validate_integer("nrows", nrows)
   1917 try:
   1918     # error: "ParserBase" has no attribute "read"
   1919     (
   1920         index,
   1921         columns,
   1922         col_dict,
-> 1923     ) = self._engine.read(  # type: ignore[attr-defined]
   1924         nrows
   1925     )
   1926 except Exception:
   1927     self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
				
			

Inspecting the CSV file through Excel or a text editor reveals that the data is separated by tabs. This inspection can show various anomalies such as inconsistent formatting, special HTML characters, and mixed data types within columns. To properly read the tab-delimited data, you can use the pandas library in Python, specifying the tab separator sep='t'. This ensures that the data is correctly parsed into columns.

Excerpt of Raw Data
				
					Heat & Air 	Voucher                       	-14.27	713.5100	USD	1.00	EA		0132139                  	Supplier Invoice Number	2023-01-03	0	fitting
Heat &amp; Air Voucher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-14.27<br>
USD1.00EA<br>
0132139&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Supplier Invoice Number<br>
2023-01-03<br>
0fitting

				
			

Let’s reconcile our code based on the feedback stemming from the errors:

fix_df
In [1]:
import pandas as pd
try:
    df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t')
    df.head()
except Exception as e:
    print(e)
C:UsersTravAppDataLocalTempipykernel_104604048537430.py:4: DtypeWarning: Columns (10,12,50) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t')

More errors! And thus we rinse & repeat starting from diagnosis.

When dealing with CSV files, it’s crucial to understand the format and delimiters used in the data. In this scenario, the CSV file is tab-delimited, which means that tabs (t) separate the columns instead of the usual commas. This can sometimes lead to issues when reading the file if the delimiter is not correctly specified.

				
					---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[6], line 3
      1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
      4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
    623     return parser
    625 with parser:
--> 626     return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
   1916 nrows = validate_integer("nrows", nrows)
   1917 try:
   1918     # error: "ParserBase" has no attribute "read"
   1919     (
   1920         index,
   1921         columns,
   1922         col_dict,
-> 1923     ) = self._engine.read(  # type: ignore[attr-defined]
   1924         nrows
   1925     )
   1926 except Exception:
   1927     self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
				
			
				
					---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[6], line 3
      1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
      4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
    623     return parser
    625 with parser:
--> 626     return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
   1916 nrows = validate_integer("nrows", nrows)
   1917 try:
   1918     # error: "ParserBase" has no attribute "read"
   1919     (
   1920         index,
   1921         columns,
   1922         col_dict,
-> 1923     ) = self._engine.read(  # type: ignore[attr-defined]
   1924         nrows
   1925     )
   1926 except Exception:
   1927     self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
				
			

Inspecting the CSV file through Excel or a text editor reveals that the data is separated by tabs. This inspection can show various anomalies such as inconsistent formatting, special HTML characters, and mixed data types within columns. To properly read the tab-delimited data, you can use the pandas library in Python, specifying the tab separator sep='t'. This ensures that the data is correctly parsed into columns.

Excerpt of Raw Data
				
					Heat & Air 	Voucher                       	-14.27	713.5100	USD	1.00	EA		0132139                  	Supplier Invoice Number	2023-01-03	0	fitting
Heat &amp; Air Voucher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-14.27<br>
USD1.00EA<br>
0132139&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Supplier Invoice Number<br>
2023-01-03<br>
0fitting

				
			
				
					Heat & Air 	Voucher                       	-14.27	713.5100	USD	1.00	EA		0132139                  	Supplier Invoice Number	2023-01-03	0	fitting
Heat &amp; Air Voucher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;-14.27<br>
USD1.00EA<br>
0132139&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Supplier Invoice Number<br>
2023-01-03<br>
0fitting

				
			

Let’s reconcile our code based on the feedback stemming from the errors:

fix_df
In [1]:
import pandas as pd
try:
    df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t')
    df.head()
except Exception as e:
    print(e)
C:UsersTravAppDataLocalTempipykernel_104604048537430.py:4: DtypeWarning: Columns (10,12,50) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t')

More errors! And thus we rinse & repeat starting from diagnosis.

Work Smarter, Not Harder

If you are frustrated already – the good news is that coding might not be for you. The even better news, is that coding has never been more accessible because of the prevalence of AI and language models. For tasks that err low-complexity, it’s entirely possible to copy and paste your error messages straight out of the console and into a chat prompt. Never has there been a better time to express what a time it is to be alive.

Generative AI for Error Handling & Code Writing

Not only will GPT clean your code, but you can even circumvent the debugging by generating functions, or a predetermined set of instructions to complete one task (ideally).

To rewind back to the idea of Python being aesthetically syntactical, we can observe some of the patterns to the left which support its readability. Instead of occupying the space and system memory with comments, the code explains itself.

As if our yield wasn’t lucrative enough, this entire script is written as a series of functions. This perpetuates the philosophy of modularity & recyclability as now our functions can be called from other scripts.

Insultation of Sensitive Data

If you are handling transaction data like this, there is a valid concern of mishandling or disclosure when using modern tools such as ChatGPT. And for all intents & purposes, I agree that no one should be blindly uploading sensitive data to the chat interface. However, we can exploit the chat interface’s value while insulating our sensitive data using Python. 

By defining functions and variables that interact with our data, and then having ChatGPT engage with our code instead – we’ve effectively created a cypher that masks our privileged information from any data harvested as model retaining protocols.

 

Reviewing Our Cleansed Data in Jupyter

Scrubbed Confidential Data There are a few steps intentionally excluded from the flow above. This is to encourage a little bit of connecting the dots. But to err on the side of caution, we’re going to manufacture an output using the library Faker. Instructions to replicate this entire exercise will be consolidated at the end of this article.
example
In [6]:
import pandas as pd
column_names = ['id', 'date', 'vendor_name', 'amount', 'currency', 'item_description', 'gl_code', 'gl_description']
df = pd.read_csv('transactions.csv', names=column_names, header=0)
df.head()
Out[6]:
id date vendor_name amount currency item_description gl_code gl_description
0 2 2022-11-06 Harding Inc 4621.52 AUD Answer account which exist reality share. 25202669 Travel
1 3 2024-01-22 Meyer LLC 4478.16 USD Year important account despite alone. 44034427 Shopping
2 4 2023-07-12 Sandoval, Gallagher and Kelley 1349.55 JPY Find which boy try. 92539337 Travel
3 5 2024-04-23 Mooney PLC 2785.77 GBP We author this compare small purpose. 20214377 Travel
4 6 2024-07-11 Duncan and Sons 2759.27 GBP Wall eight crime daughter prove. 83986396 Travel
example
In [6]:
import pandas as pd
column_names = ['id', 'date', 'vendor_name', 'amount', 'currency', 'item_description', 'gl_code', 'gl_description']
df = pd.read_csv('transactions.csv', names=column_names, header=0)
df.head()
Out[6]:
id date vendor_name amount currency item_description gl_code gl_description
0 2 2022-11-06 Harding Inc 4621.52 AUD Answer account which exist reality share. 25202669 Travel
1 3 2024-01-22 Meyer LLC 4478.16 USD Year important account despite alone. 44034427 Shopping
2 4 2023-07-12 Sandoval, Gallagher and Kelley 1349.55 JPY Find which boy try. 92539337 Travel
3 5 2024-04-23 Mooney PLC 2785.77 GBP We author this compare small purpose. 20214377 Travel
4 6 2024-07-11 Duncan and Sons 2759.27 GBP Wall eight crime daughter prove. 83986396 Travel

Database Storage

We are going to complete the staging step by moving our dataframe into a database. But why databases? They are efficient for securely handling large datasets because they support fast queries and data integrity. We also don’t have to manage conflicts of input/output and the state of a csv or xlsx; god forbid multiple people editing it in parallel. 

For our database, SQLite is a lightweight, disk-based database that doesn’t require a server setup. This script consolidates all of the cleansing efforts in Jupyter and initializes our database and loads our dataframe as a table.

				
					import sqlite3
def store_data_in_db(df, db_file_path, table_name):
    # Connect to the SQLite database
    conn = sqlite3.connect(db_file_path)
    cursor = conn.cursor()
    # Store DataFrame to SQLite table
    df.to_sql(table_name, conn, if_exists='replace', index=False)
    # Commit changes and close connection
    conn.commit()
    conn.close()
    print(f"Cleaned data has been successfully stored in the database '{db_file_path}' in the table '{table_name}'.")
# Store the cleaned data
store_data_in_db(df_cleaned, 'cleaned_data.db', 'cleaned_transaction_data')

				
			

Empowering Data with Purpose

Streamlit

Streamlit is an open-source app framework generally used by Machine Learning and Data Science teams. I don’t think they should keep it to themselves because it’s the personification of a chef’s kiss: beautiful, custom web apps to interpret data with minimal effort. That said, while Streamlit is fantastic for simple, single-page applications, it may not be the best choice for complex, multi-page applications. For more intricate projects, other frameworks might be more suitable.

Key Features
  1. Interactive Visualizations: Pygwalker provides a wide range of interactive charts and graphs that can be customized to fit your data exploration needs.
  2. Ease of Use: The library is designed to be intuitive and easy to use, allowing users to create visualizations with minimal code.
  3. Integration with Pandas: Pygwalker integrates seamlessly with Pandas, the popular data manipulation library in Python, making it easy to visualize data directly from Pandas DataFrames.
  4. Customizability: It offers extensive customization options, allowing users to tweak the appearance and behavior of their charts to suit their preferences.
  5. Web-Based Visualization: The visualizations created with Pygwalker are rendered as web components, which means they can be easily embedded in web applications and shared with others.
Use Cases
  • Data Exploration: Quickly create visual representations of your data to identify trends, patterns, and outliers.
  • Reporting: Generate interactive reports that can be shared with stakeholders.
  • Presentations: Enhance your presentations with dynamic and engaging visualizations.
  • Dashboards: Build interactive dashboards to monitor key metrics and KPIs.
How do we use it?

We’re going to combine everything we’ve stepped through and create a dataframe using what is now stored in our database:

				
					import pandas as pd
import sqlite3
import streamlit as st
import pygwalker as pyg
def load_data_from_db(db_file_path, table_name):
    # Connect to the SQLite database
    conn = sqlite3.connect(db_file_path)
    # Load data from the specified table
    df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
    # Close the database connection
    conn.close()
    return df
# Load cleaned data from the database
df_from_db = load_data_from_db('cleaned_data.db', 'cleaned_transaction_data')
# Create a Pygwalker dashboard with Streamlit
st.title("Transaction Data Dashboard")
# Display the dataframe
st.write(df_from_db)
# Create a Pygwalker dashboard
pyg.walk(df_from_db)
# Run Streamlit
if __name__ == "__main__":
    st._main_run_clExplicitly("streamlit.cli", "main")

				
			

The script above does the following:

  1. Loads Data: It defines a function to load cleaned transaction data from an SQLite database into a pandas DataFrame.
  2. Displays Data: It uses Streamlit to create a web app that displays the DataFrame.
  3. Creates Dashboard: It leverages Pygwalker to generate an interactive dashboard from the data.
  4. Runs App: It runs the Streamlit app to provide an interactive interface for data visualization.
  If we are all chef’s preparing a meal, we can top off the hard work with a little garnish by command line input.
				
					streamlit run app.py
				
			

And viola. This shouldn’t take you more than 30 minutes to complete at a $70 discount over tableau (which still requires cleaning data!).

FAQ

How can I replicate this step by step without the commentary?

I’ve posted this entire walkthrough to GitHub, where you can copy the repository and navigate step by step through the exercises above.

Outside of an overarching GPT subscription, this cost absolutely nothing to reproduce.

Feel free to post questions in the comments, github, or email me directly at travis@vasceannie.com

I’ve posted this entire walkthrough to GitHub, where you can copy the repository and navigate step by step through the exercises above.

Outside of an overarching GPT subscription, this cost absolutely nothing to reproduce.

Feel free to post questions in the comments, github, or email me directly at travis@vasceannie.com

3 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

Share the Post: