27 Jul 2024 11 min read Uncategorized

Captivate An Audience Using Stunning Open-Source Data Visualization Tools

Learn how to create a robust data cleaning and visualization tool using Python, Pandas, SQLite, and Streamlit with the help of AI to simplify and debug your code. This setup can handle large datasets and provide beautiful visualizations, making it great for quick prototyping and sharing data visuali

.elementor-widget-divider{--divider-border-style:none;--divider-border-width:1px;--divider-color:#0c0d0e;--divider-icon-size:20px;--divider-element-spacing:10px;--divider-pattern-height:24px;--divider-pattern-size:20px;--divider-pattern-url:none;--divider-pattern-repeat:repeat-x}.elementor-widget-divider .elementor-divider{display:flex}.elementor-widget-divider .elementor-divider__text{font-size:15px;line-height:1;max-width:95%}.elementor-widget-divider .elementor-divider__element{margin:0 var(--divider-element-spacing);flex-shrink:0}.elementor-widget-divider .elementor-icon{font-size:var(--divider-icon-size)}.elementor-widget-divider .elementor-divider-separator{display:flex;margin:0;direction:ltr}.elementor-widget-divider--view-line_icon .elementor-divider-separator,.elementor-widget-divider--view-line_text .elementor-divider-separator{align-items:center}.elementor-widget-divider--view-line_icon .elementor-divider-separator:after,.elementor-widget-divider--view-line_icon .elementor-divider-separator:before,.elementor-widget-divider--view-line_text .elementor-divider-separator:after,.elementor-widget-divider--view-line_text .elementor-divider-separator:before{display:block;content:"";border-block-end:0;flex-grow:1;border-block-start:var(--divider-border-width) var(--divider-border-style) var(--divider-color)}.elementor-widget-divider--element-align-left .elementor-divider .elementor-divider-separator>.elementor-divider__svg:first-of-type{flex-grow:0;flex-shrink:100}.elementor-widget-divider--element-align-left .elementor-divider-separator:before{content:none}.elementor-widget-divider--element-align-left .elementor-divider__element{margin-left:0}.elementor-widget-divider--element-align-right .elementor-divider .elementor-divider-separator>.elementor-divider__svg:last-of-type{flex-grow:0;flex-shrink:100}.elementor-widget-divider--element-align-right .elementor-divider-separator:after{content:none}.elementor-widget-divider--element-align-right .elementor-divider__element{margin-right:0}.elementor-widget-divider--element-align-start .elementor-divider .elementor-divider-separator>.elementor-divider__svg:first-of-type{flex-grow:0;flex-shrink:100}.elementor-widget-divider--element-align-start .elementor-divider-separator:before{content:none}.elementor-widget-divider--element-align-start .elementor-divider__element{margin-inline-start:0}.elementor-widget-divider--element-align-end .elementor-divider .elementor-divider-separator>.elementor-divider__svg:last-of-type{flex-grow:0;flex-shrink:100}.elementor-widget-divider--element-align-end .elementor-divider-separator:after{content:none}.elementor-widget-divider--element-align-end .elementor-divider__element{margin-inline-end:0}.elementor-widget-divider:not(.elementor-widget-divider--view-line_text):not(.elementor-widget-divider--view-line_icon) .elementor-divider-separator{border-block-start:var(--divider-border-width) var(--divider-border-style) var(--divider-color)}.elementor-widget-divider--separator-type-pattern{--divider-border-style:none}.elementor-widget-divider--separator-type-pattern.elementor-widget-divider--view-line .elementor-divider-separator,.elementor-widget-divider--separator-type-pattern:not(.elementor-widget-divider--view-line) .elementor-divider-separator:after,.elementor-widget-divider--separator-type-pattern:not(.elementor-widget-divider--view-line) .elementor-divider-separator:before,.elementor-widget-divider--separator-type-pattern:not([class*=elementor-widget-divider--view]) .elementor-divider-separator{width:100%;min-height:var(--divider-pattern-height);-webkit-mask-size:var(--divider-pattern-size) 100%;mask-size:var(--divider-pattern-size) 100%;-webkit-mask-repeat:var(--divider-pattern-repeat);mask-repeat:var(--divider-pattern-repeat);background-color:var(--divider-color);-webkit-mask-image:var(--divider-pattern-url);mask-image:var(--divider-pattern-url)}.elementor-widget-divider--no-spacing{--divider-pattern-size:auto}.elementor-widget-divider--bg-round{--divider-pattern-repeat:round}.rtl .elementor-widget-divider .elementor-divider__text{direction:rtl}.e-con-inner>.elementor-widget-divider,.e-con>.elementor-widget-divider{width:var(--container-widget-width,100%);--flex-grow:var(--container-widget-flex-grow)}

I started learning Python a little over a year ago after having learned that children across the world were learning programming in grade school. It spurred this belief that a conversational wherewithal of programming in pseudo-code were becoming a generational prerequisite.

And yet, I'm stuck between belief & doubt regarding the amount of savvy one needs to thrive in this world. While Nvidia CEO Jensen Huang thinks AI will make coding obsolete, there's an abundance of evidence that suggests otherwise.

Nonetheless, I don't regret anything. The incalculable wealth of knowledge and experience across the internet has brought me to the philosophy that "if you can conceive of it (within reason), you can build it". The abundance of open-source software and public GitHub repositories facilitates a sort of contradiction where solutions to many problems are abundant, but none of them are your solution. The result becomes a gauntlet of critical thinking & problem solving triggering a satisfying dopamine rush at the finish line.

Your milage may vary and you may not find the same degree of near-masochistic mental stimulation. But at the very least, I can put myself through the ringer and struggle through the hard part so you don't have to.

Unstructured or dirty data is as commonplace and in circulation as the litter occupying our city streets and gutters. Manufacturing pretty dashboards like this are often sourced from data characterized by a lack of organization, consistency, and cleanliness. Typical traits of such include mixed data types (which I've come to refer to as 'Jellybeans') , duplicates, missing or incorrectly formatted data. While mishandling is typically a prime enabler of dirty data, there are many scenarios it's unavoidable. Such as:

Understandably, many people may not consider the janitorial effort in presenting manicured dashboards and reports. Below is a table illustrating data that could be considered in rough shape. Today, we'll share some tactics that can add to your arsenal to efficiently manage the invisible work that goes into priming data for a dashboard.

Visual Studio Code is a powerful Integrated Development Environment (IDE) that makes it easy to write, test, and deploy code. To get started, simply download the latest version from the official Microsoft website and follow the on-screen instructions.

Anaconda, or Conda, is a popular open-source package manager and environment management system that simplifies the process of installing, running, and updating Python packages. You can download the Anaconda distribution, which includes Conda, Python, and a variety of scientific computing libraries, from the Anaconda website.

We're going to skip the installation of the two since they are fairly straight forward, and instead, and pick up inside our Integrated Development Environment, or IDE.

I'll be the first the admit that playing the role of data custodian is not fun. It's prone to mistakes while necessitating a peculiar breed of patience & diligence. But that's not the reason we're here, right? As you progress through this exercise, keep a running tally of the comparative duration of cleaning data through excel. No pain, no pleasure and the juice is worth the squeeze.

Pandas is a powerful data manipulation library in Python, and Jupyter is an interactive notebook environment that allows you to run code and combine your output with markdown to help interpret data. Humor me with this because I really love metaphors.

Imagine Python as the master chef in a hyper efficient kitchen, and pandas as their skilled sous-chef. Together, they transform raw ingredients (data) into exquisite dishes (insights). Jupyter can be seen as the dining room where the master chef and sous-chef present their culinary creations to the guests (data scientists and analysts).

Python: The master chef, overseeing the entire process, ensuring that every step is executed with precision. It handles the broad tasks, directs the flow of the kitchen, and integrates various techniques and tools to create a seamless culinary experience.

pandas: The sous-chef, focusing on the meticulous preparation and initial processing of ingredients. It chops, dices, and preps the data, making it ready for the master chef to work their magic. Pandas ensures that the data is clean, organized, and in the perfect state for further analysis.

Jupyter Notebook: The dining room, designed for showcasing the culinary (data) creations in an interactive and visually appealing manner. It’s where the master chef presents each dish step-by-step, allowing the guests to appreciate the process, the ingredients, and the final product.

The terminal is your machine's command-line interface that enables interaction with your computer's operating system directly. Whether you're using a Windows or Mac machine, the terminal provides a versatile platform for executing various commands, including installing python libraries like data analysis tools like Pandas and Jupyter Notebook.

Pip is a powerful package manager for Python, enabling easy installation and management of libraries.

Libraries are collections of pre-written code that extend Python's functionality.

Virtual Environments (venv) isolate project dependencies, ensuring consistent development and deployment × Dismiss this alert.

If there were ever a good time to stretch, this the moment that any ritualistic urges should flourish. And if you're still with me and have a Jupyter notebook open, we can import pandas and parse our csv into a dataframe. We're going to draw from a more real world example for this:

"Hi pandas, I'm going to call you pd"

"pd, can you try to create a dataframe, df. I need it filled with everything contained in this csv. Then show me the first 5 rows using df.head() to validate?"

"I'm not saying something is going to go wrong (except Exception as e:), but if it does, let me know: print(e) what did"

Lo' and behold, we get an error. But we didn't even look at our data did we?

Barring few exceptions, majority of the libraries & modules that Python brings to the table are imported to your script. Remember to maintain environments if your script has any dependencies such as pandas, to prevent any boo-boos.

The as pd part of our statement is signaling to our IDE that we will be abbreviating pandas into pd, simultaneously opening up all of pandas methods to us

Try-blocks provide a structured approach to error handling, allowing you to anticipate and gracefully manage potential issues that may arise during program execution.

By encapsulating code within a try-block, you can proactively address and resolve errors, ensuring a seamless user experience and maintaining the integrity of your application.

head() is a method distinct to pandas dataframes that becomes available to use after reading a csv into the object we defined as df. Its output is a snapshot of the first 5 rows in a table.

Why this is important is because of Python’s Global Interpreter Lock (GIL), a very fancy way to define the ability to only ‘do one thing at a time.’

Barring few exceptions, majority of the libraries & modules that Python brings to the table are imported to your script. Remember to maintain environments if your script has any dependencies such as pandas, to prevent any boo-boos.

The as pd part of our statement is signaling to our IDE that we will be abbreviating pandas into pd, simultaneously opening up all of pandas methods to us

Try-blocks provide a structured approach to error handling, allowing you to anticipate and gracefully manage potential issues that may arise during program execution.

By encapsulating code within a try-block, you can proactively address and resolve errors, ensuring a seamless user experience and maintaining the integrity of your application.

head() is a method distinct to pandas dataframes that becomes available to use after reading a csv into the object we defined as df. Its output is a snapshot of the first 5 rows in a table.

Why this is important is because of Python's Global Interpreter Lock (GIL), a very fancy way to define the ability to only 'do one thing at a time.'

When dealing with CSV files, it’s crucial to understand the format and delimiters used in the data. In this scenario, the CSV file is tab-delimited, which means that tabs (t) separate the columns instead of the usual commas. This can sometimes lead to issues when reading the file if the delimiter is not correctly specified.

Inspecting the CSV file through Excel or a text editor reveals that the data is separated by tabs. This inspection can show various anomalies such as inconsistent formatting, special HTML characters, and mixed data types within columns. To properly read the tab-delimited data, you can use the pandas library in Python, specifying the tab separator sep='t'. This ensures that the data is correctly parsed into columns.

Let’s reconcile our code based on the feedback stemming from the errors:

ParserError Traceback (most recent call last)
Cell In[6], line 3
1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
623 return parser
625 with parser:
--> 626 return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
1916 nrows = validate_integer("nrows", nrows)
1917 try:
1918 # error: "ParserBase" has no attribute "read"
1919 (
1920 index,
1921 columns,
1922 col_dict,
-> 1923 ) = self._engine.read( # type: ignore[attr-defined]
1924 nrows
1925 )
1926 except Exception:
1927 self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
232 try:
233 if self.low_memory:
--> 234 chunks = self._reader.read_low_memory(nrows)
235 # destructive to chunks
236 data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5

Inspecting the CSV file through Excel or a text editor reveals that the data is separated by tabs. This inspection can show various anomalies such as inconsistent formatting, special HTML characters, and mixed data types within columns. To properly read the tab-delimited data, you can use the pandas library in Python, specifying the tab separator sep='t'. This ensures that the data is correctly parsed into columns.

Let's reconcile our code based on the feedback stemming from the errors:

Not only will GPT clean your code, but you can even circumvent the debugging by generating functions, or a predetermined set of instructions to complete one task (ideally).

To rewind back to the idea of Python being aesthetically syntactical, we can observe some of the patterns to the left which support its readability. Instead of occupying the space and system memory with comments, the code explains itself.

As if our yield wasn't lucrative enough, this entire script is written as a series of functions. This perpetuates the philosophy of modularity & recyclability as now our functions can be called from other scripts.

If you are handling transaction data like this, there is a valid concern of mishandling or disclosure when using modern tools such as ChatGPT. And for all intents & purposes, I agree that no one should be blindly uploading sensitive data to the chat interface. However, we can exploit the chat interface's value while insulating our sensitive data using Python.

By defining functions and variables that interact with our data, and then having ChatGPT engage with our code instead - we've effectively created a cypher that masks our privileged information from any data harvested as model retaining protocols.

For our database, SQLite is a lightweight, disk-based database that doesn’t require a server setup. This script consolidates all of the cleansing efforts in Jupyter and initializes our database and loads our dataframe as a table.

Streamlit is an open-source app framework generally used by Machine Learning and Data Science teams. I don't think they should keep it to themselves because it's the personification of a chef's kiss: beautiful, custom web apps to interpret data with minimal effort. That said, while Streamlit is fantastic for simple, single-page applications, it may not be the best choice for complex, multi-page applications. For more intricate projects, other frameworks might be more suitable.

We're going to combine everything we've stepped through and create a dataframe using what is now stored in our database:

The script above does the following:

And viola. This shouldn't take you more than 30 minutes to complete at a $70 discount over tableau (which still requires cleaning data!).

I’ve posted this entire walkthrough to GitHub, where you can copy the repository and navigate step by step through the exercises above.

Outside of an overarching GPT subscription, this cost absolutely nothing to reproduce.

Feel free to post questions in the comments, github, or email me directly at [email protected]

I've posted this entire walkthrough to GitHub, where you can copy the repository and navigate step by step through the exercises above.

Outside of an overarching GPT subscription, this cost absolutely nothing to reproduce.

Feel free to post questions in the comments, github, or email me directly at [email protected]

You might also like...

Japan: A Unique and Enlightening Experience

Mastering the Art of AI Conversations: Unlocking the Power of Smart Prompts and Virtual Agents

Understanding Retrieval Augmented Generation (RAG) and Its Relevance in AI

Optimize Your Workflow with Voice-to-Text & GPT: How AI Can Boost Productivity

Eliminating Thousands of Hours Classifying Data with Langchain