As an aside; I’m still trying to figure out an identity for this here corner of ye’ internet. The feedback on the previous article was galvanizing, manifesting a commitment to produce quality content. Feedback is welcomed and your voices are heard. Most of all, thanks for finding your way back to Things with Stuff.
Prerequisites
- Visual Studio Code
- Python ver. 3.8+
- AI Assistant (Claude 3.5 Sonnet > GPT 4o at the time of writing)
- Visual Studio Code
- Python ver. 3.8+
- AI Assistant (Claude 3.5 Sonnet > GPT 4o at the time of writing)
Introduction
I started learning Python a little over a year ago after having learned that children across the world were learning programming in grade school. It spurred this belief that a conversational wherewithal of programming in pseudo-code were becoming a generational prerequisite.
Pseudo-Code Pseudocode is an informal, high-level description of a computer program’s logic. The phrase ”if this, then that’ is a fairly accurate representation of a application’s work instructions. For example, IF weather = rain, THEN backpack contains umbrella, or our THAT in this case.And yet, I’m stuck between belief & doubt regarding the amount of savvy one needs to thrive in this world. While Nvidia CEO Jensen Huang thinks AI will make coding obsolete, there’s an abundance of evidence that suggests otherwise.
Nonetheless, I don’t regret anything. The incalculable wealth of knowledge and experience across the internet has brought me to the philosophy that “if you can conceive of it (within reason), you can build it”. The abundance of open-source software and public GitHub repositories facilitates a sort of contradiction where solutions to many problems are abundant, but none of them are your solution. The result becomes a gauntlet of critical thinking & problem solving triggering a satisfying dopamine rush at the finish line.
Your milage may vary and you may not find the same degree of near-masochistic mental stimulation. But at the very least, I can put myself through the ringer and struggle through the hard part so you don’t have to.
Context
Unstructured or dirty data is as commonplace and in circulation as the litter occupying our city streets and gutters. Manufacturing pretty dashboards like this are often sourced from data characterized by a lack of organization, consistency, and cleanliness. Typical traits of such include mixed data types (which I’ve come to refer to as ‘Jellybeans’) , duplicates, missing or incorrectly formatted data. While mishandling is typically a prime enabler of dirty data, there are many scenarios it’s unavoidable. Such as:
- Customer Feedback: Free-form text entries from surveys, reviews, and social media comments. These can contain misspellings, slang, and varying levels of detail.
- Email Communications: Texts from emails which can be lengthy, contain nested threads, attachments, and varying formats that make automated parsing challenging.
- Scanned Documents: PDFs or images of documents that require OCR (Optical Character Recognition) to convert into text, often resulting in errors and inconsistencies.
- Web Scraping: Data extracted from websites that often comes in an unstructured format, with HTML tags, scripts, and styles that need to be cleaned and parsed.
Understandably, many people may not consider the janitorial effort in presenting manicured dashboards and reports. Below is a table illustrating data that could be considered in rough shape. Today, we’ll share some tactics that can add to your arsenal to efficiently manage the invisible work that goes into priming data for a dashboard.
ID | Name | Age | Join Date | Department | Salary | |
---|---|---|---|---|---|---|
1 | John Doe | john.doe@example.com | 29 | 2021-06-15 | Sales | $50000 |
2 | Jane Smith | janesmithATexample.com | 34 | 2015-07-20 | Marketing | 62000 |
3 | Sam O’Neil | sam.oneil@example | N/A | 2020-11-03 | IT | 70,000 |
4 | Mark Brown | mark.brown@example.com | 41 | 2021-09-23 | HR | 68000 |
5 | Emily Johnson | emily.johnson@.com | 27 | 2022-02-30 | IT | -45000 |
6 | Michael Lee | 25 | 2020-14-12 | Engineering | 72.000 | |
7 | Robert Brown | robert.brown@example.com | 45 | 2021-06-15 | Sales | N/A |
Toolkit
Installing Visual Studio, Conda, and Python
Visual Studio Code is a powerful Integrated Development Environment (IDE) that makes it easy to write, test, and deploy code. To get started, simply download the latest version from the official Microsoft website and follow the on-screen instructions.
Anaconda, or Conda, is a popular open-source package manager and environment management system that simplifies the process of installing, running, and updating Python packages. You can download the Anaconda distribution, which includes Conda, Python, and a variety of scientific computing libraries, from the Anaconda website.
We’re going to skip the installation of the two since they are fairly straight forward, and instead, and pick up inside our Integrated Development Environment, or IDE.
Cleaning & Staging Our Data
I’ll be the first the admit that playing the role of data custodian is not fun. It’s prone to mistakes while necessitating a peculiar breed of patience & diligence. But that’s not the reason we’re here, right? As you progress through this exercise, keep a running tally of the comparative duration of cleaning data through excel. No pain, no pleasure and the juice is worth the squeeze.
Pandas & Jupyter
Pandas is a powerful data manipulation library in Python, and Jupyter is an interactive notebook environment that allows you to run code and combine your output with markdown to help interpret data. Humor me with this because I really love metaphors.
Imagine Python as the master chef in a hyper efficient kitchen, and pandas as their skilled sous-chef. Together, they transform raw ingredients (data) into exquisite dishes (insights). Jupyter can be seen as the dining room where the master chef and sous-chef present their culinary creations to the guests (data scientists and analysts).
-
-
Python: The master chef, overseeing the entire process, ensuring that every step is executed with precision. It handles the broad tasks, directs the flow of the kitchen, and integrates various techniques and tools to create a seamless culinary experience.
-
pandas: The sous-chef, focusing on the meticulous preparation and initial processing of ingredients. It chops, dices, and preps the data, making it ready for the master chef to work their magic. Pandas ensures that the data is clean, organized, and in the perfect state for further analysis.
-
Jupyter Notebook: The dining room, designed for showcasing the culinary (data) creations in an interactive and visually appealing manner. It’s where the master chef presents each dish step-by-step, allowing the guests to appreciate the process, the ingredients, and the final product.
-
The terminal is your machine’s command-line interface that enables interaction with your computer’s operating system directly. Whether you’re using a Windows or Mac machine, the terminal provides a versatile platform for executing various commands, including installing python libraries like data analysis tools like Pandas and Jupyter Notebook.
pip install pandas jupyter
Pip, Libraries, and Virtual Environments
Pip is a powerful package manager for Python, enabling easy installation and management of libraries.
Libraries are collections of pre-written code that extend Python’s functionality.
Virtual Environments (venv) isolate project dependencies, ensuring consistent development and deployment
If there were ever a good time to stretch, this the moment that any ritualistic urges should flourish. And if you’re still with me and have a Jupyter notebook open, we can import pandas and parse our csv into a dataframe. We’re going to draw from a more real world example for this:
import pandas as pd try: df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv') df.head() except Exception as e: print(e)
Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
If I were to translate this into human-speak, it would resemble something like:
“Hi pandas, I’m going to call you pd”
“pd, can you try to create a dataframe, df
. I need it filled with everything contained in this csv. Then show me the first 5 rows using df.head()
to validate?”
“I’m not saying something is going to go wrong (except Exception as e:
), but if it does, let me know: print(e)
what did”
Lo’ and behold, we get an error. But we didn’t even look at our data did we?
Error Handling Exceptions, as the connotation suggests, are a means to manage the routing of our application logic. Establishing the equivalent to checks, stops, & bypasses for occurrences that would result in a failure.Breakdown
Imports
import pandas as pd
Barring few exceptions, majority of the libraries & modules that Python brings to the table are imported to your script. Remember to maintain environments if your script has any dependencies such as pandas, to prevent any boo-boos.
The as pd part of our statement is signaling to our IDE that we will be abbreviating pandas into pd, simultaneously opening up all of pandas methods to us
Try Blocks
try:
df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
except Exception as e:
print(e)
Try-blocks provide a structured approach to error handling, allowing you to anticipate and gracefully manage potential issues that may arise during program execution.
By encapsulating code within a try-block, you can proactively address and resolve errors, ensuring a seamless user experience and maintaining the integrity of your application.
df.head()
df.head()
print(e)
head() is a method distinct to pandas dataframes that becomes available to use after reading a csv into the object we defined as df. Its output is a snapshot of the first 5 rows in a table.
Why this is important is because of Python’s Global Interpreter Lock (GIL), a very fancy way to define the ability to only ‘do one thing at a time.’
import pandas as pd
Barring few exceptions, majority of the libraries & modules that Python brings to the table are imported to your script. Remember to maintain environments if your script has any dependencies such as pandas, to prevent any boo-boos.
The as pd part of our statement is signaling to our IDE that we will be abbreviating pandas into pd, simultaneously opening up all of pandas methods to us
try:
df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
except Exception as e:
print(e)
Try-blocks provide a structured approach to error handling, allowing you to anticipate and gracefully manage potential issues that may arise during program execution.
By encapsulating code within a try-block, you can proactively address and resolve errors, ensuring a seamless user experience and maintaining the integrity of your application.
df.head()
print(e)
head() is a method distinct to pandas dataframes that becomes available to use after reading a csv into the object we defined as df. Its output is a snapshot of the first 5 rows in a table.
Why this is important is because of Python’s Global Interpreter Lock (GIL), a very fancy way to define the ability to only ‘do one thing at a time.’
When dealing with CSV files, it’s crucial to understand the format and delimiters used in the data. In this scenario, the CSV file is tab-delimited, which means that tabs (t
) separate the columns instead of the usual commas. This can sometimes lead to issues when reading the file if the delimiter is not correctly specified.
The full error displayed in Terminal
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
Cell In[6], line 3
1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
623 return parser
625 with parser:
--> 626 return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
1916 nrows = validate_integer("nrows", nrows)
1917 try:
1918 # error: "ParserBase" has no attribute "read"
1919 (
1920 index,
1921 columns,
1922 col_dict,
-> 1923 ) = self._engine.read( # type: ignore[attr-defined]
1924 nrows
1925 )
1926 except Exception:
1927 self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
232 try:
233 if self.low_memory:
--> 234 chunks = self._reader.read_low_memory(nrows)
235 # destructive to chunks
236 data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
Inspecting the CSV file through Excel or a text editor reveals that the data is separated by tabs. This inspection can show various anomalies such as inconsistent formatting, special HTML characters, and mixed data types within columns. To properly read the tab-delimited data, you can use the pandas
library in Python, specifying the tab separator sep='t'
. This ensures that the data is correctly parsed into columns.
Excerpt of Raw Data
Heat & Air Voucher -14.27 713.5100 USD 1.00 EA 0132139 Supplier Invoice Number 2023-01-03 0 fitting
Heat & Air Voucher -14.27
USD1.00EA
0132139 Supplier Invoice Number
2023-01-03
0fitting
Let’s reconcile our code based on the feedback stemming from the errors:
import pandas as pd try: df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t') df.head() except Exception as e: print(e)
C:UsersTravAppDataLocalTempipykernel_104604048537430.py:4: DtypeWarning: Columns (10,12,50) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t')
More errors! And thus we rinse & repeat starting from diagnosis.
When dealing with CSV files, it’s crucial to understand the format and delimiters used in the data. In this scenario, the CSV file is tab-delimited, which means that tabs (t
) separate the columns instead of the usual commas. This can sometimes lead to issues when reading the file if the delimiter is not correctly specified.
The full error displayed in Terminal
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
Cell In[6], line 3
1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
623 return parser
625 with parser:
--> 626 return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
1916 nrows = validate_integer("nrows", nrows)
1917 try:
1918 # error: "ParserBase" has no attribute "read"
1919 (
1920 index,
1921 columns,
1922 col_dict,
-> 1923 ) = self._engine.read( # type: ignore[attr-defined]
1924 nrows
1925 )
1926 except Exception:
1927 self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
232 try:
233 if self.low_memory:
--> 234 chunks = self._reader.read_low_memory(nrows)
235 # destructive to chunks
236 data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
Cell In[6], line 3
1 import pandas as pd
----> 3 df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv')
4 df.head()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:626, in _read(filepath_or_buffer, kwds)
623 return parser
625 with parser:
--> 626 return parser.read(nrows)
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersreaders.py:1923, in TextFileReader.read(self, nrows)
1916 nrows = validate_integer("nrows", nrows)
1917 try:
1918 # error: "ParserBase" has no attribute "read"
1919 (
1920 index,
1921 columns,
1922 col_dict,
-> 1923 ) = self._engine.read( # type: ignore[attr-defined]
1924 nrows
1925 )
1926 except Exception:
1927 self.close()
File E:FuzzyMatcher.venvspend_asseessmentLibsite-packagespandasioparsersc_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
232 try:
233 if self.low_memory:
--> 234 chunks = self._reader.read_low_memory(nrows)
235 # destructive to chunks
236 data = _concatenate_chunks(chunks)
File parsers.pyx:838, in pandas._libs.parsers.TextReader.read_low_memory()
File parsers.pyx:905, in pandas._libs.parsers.TextReader._read_rows()
File parsers.pyx:874, in pandas._libs.parsers.TextReader._tokenize_rows()
File parsers.pyx:891, in pandas._libs.parsers.TextReader._check_tokenize_status()
File parsers.pyx:2061, in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 11, saw 5
Inspecting the CSV file through Excel or a text editor reveals that the data is separated by tabs. This inspection can show various anomalies such as inconsistent formatting, special HTML characters, and mixed data types within columns. To properly read the tab-delimited data, you can use the pandas
library in Python, specifying the tab separator sep='t'
. This ensures that the data is correctly parsed into columns.
Excerpt of Raw Data
Heat & Air Voucher -14.27 713.5100 USD 1.00 EA 0132139 Supplier Invoice Number 2023-01-03 0 fitting
Heat & Air Voucher -14.27
USD1.00EA
0132139 Supplier Invoice Number
2023-01-03
0fitting
Heat & Air Voucher -14.27 713.5100 USD 1.00 EA 0132139 Supplier Invoice Number 2023-01-03 0 fitting
Heat & Air Voucher -14.27
USD1.00EA
0132139 Supplier Invoice Number
2023-01-03
0fitting
Let’s reconcile our code based on the feedback stemming from the errors:
import pandas as pd try: df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t') df.head() except Exception as e: print(e)
C:UsersTravAppDataLocalTempipykernel_104604048537430.py:4: DtypeWarning: Columns (10,12,50) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv('../data/Spend_Intake_010123_123123_071024.csv', sep='t')
More errors! And thus we rinse & repeat starting from diagnosis.
Work Smarter, Not Harder
If you are frustrated already – the good news is that coding might not be for you. The even better news, is that coding has never been more accessible because of the prevalence of AI and language models. For tasks that err low-complexity, it’s entirely possible to copy and paste your error messages straight out of the console and into a chat prompt. Never has there been a better time to express what a time it is to be alive.
Generative AI for Error Handling & Code Writing
Not only will GPT clean your code, but you can even circumvent the debugging by generating functions, or a predetermined set of instructions to complete one task (ideally).
To rewind back to the idea of Python being aesthetically syntactical, we can observe some of the patterns to the left which support its readability. Instead of occupying the space and system memory with comments, the code explains itself.
As if our yield wasn’t lucrative enough, this entire script is written as a series of functions. This perpetuates the philosophy of modularity & recyclability as now our functions can be called from other scripts.
Insultation of Sensitive Data
If you are handling transaction data like this, there is a valid concern of mishandling or disclosure when using modern tools such as ChatGPT. And for all intents & purposes, I agree that no one should be blindly uploading sensitive data to the chat interface. However, we can exploit the chat interface’s value while insulating our sensitive data using Python.
By defining functions and variables that interact with our data, and then having ChatGPT engage with our code instead – we’ve effectively created a cypher that masks our privileged information from any data harvested as model retaining protocols.
Reviewing Our Cleansed Data in Jupyter
Scrubbed Confidential Data There are a few steps intentionally excluded from the flow above. This is to encourage a little bit of connecting the dots. But to err on the side of caution, we’re going to manufacture an output using the library Faker. Instructions to replicate this entire exercise will be consolidated at the end of this article.Jupyter Output
import pandas as pd column_names = ['id', 'date', 'vendor_name', 'amount', 'currency', 'item_description', 'gl_code', 'gl_description'] df = pd.read_csv('transactions.csv', names=column_names, header=0) df.head()Out[6]:
id | date | vendor_name | amount | currency | item_description | gl_code | gl_description | |
---|---|---|---|---|---|---|---|---|
0 | 2 | 2022-11-06 | Harding Inc | 4621.52 | AUD | Answer account which exist reality share. | 25202669 | Travel |
1 | 3 | 2024-01-22 | Meyer LLC | 4478.16 | USD | Year important account despite alone. | 44034427 | Shopping |
2 | 4 | 2023-07-12 | Sandoval, Gallagher and Kelley | 1349.55 | JPY | Find which boy try. | 92539337 | Travel |
3 | 5 | 2024-04-23 | Mooney PLC | 2785.77 | GBP | We author this compare small purpose. | 20214377 | Travel |
4 | 6 | 2024-07-11 | Duncan and Sons | 2759.27 | GBP | Wall eight crime daughter prove. | 83986396 | Travel |
import pandas as pd column_names = ['id', 'date', 'vendor_name', 'amount', 'currency', 'item_description', 'gl_code', 'gl_description'] df = pd.read_csv('transactions.csv', names=column_names, header=0) df.head()Out[6]:
id | date | vendor_name | amount | currency | item_description | gl_code | gl_description | |
---|---|---|---|---|---|---|---|---|
0 | 2 | 2022-11-06 | Harding Inc | 4621.52 | AUD | Answer account which exist reality share. | 25202669 | Travel |
1 | 3 | 2024-01-22 | Meyer LLC | 4478.16 | USD | Year important account despite alone. | 44034427 | Shopping |
2 | 4 | 2023-07-12 | Sandoval, Gallagher and Kelley | 1349.55 | JPY | Find which boy try. | 92539337 | Travel |
3 | 5 | 2024-04-23 | Mooney PLC | 2785.77 | GBP | We author this compare small purpose. | 20214377 | Travel |
4 | 6 | 2024-07-11 | Duncan and Sons | 2759.27 | GBP | Wall eight crime daughter prove. | 83986396 | Travel |
Database Storage
We are going to complete the staging step by moving our dataframe into a database. But why databases? They are efficient for securely handling large datasets because they support fast queries and data integrity. We also don’t have to manage conflicts of input/output and the state of a csv or xlsx; god forbid multiple people editing it in parallel.
For our database, SQLite is a lightweight, disk-based database that doesn’t require a server setup. This script consolidates all of the cleansing efforts in Jupyter and initializes our database and loads our dataframe as a table.
import sqlite3
def store_data_in_db(df, db_file_path, table_name):
# Connect to the SQLite database
conn = sqlite3.connect(db_file_path)
cursor = conn.cursor()
# Store DataFrame to SQLite table
df.to_sql(table_name, conn, if_exists='replace', index=False)
# Commit changes and close connection
conn.commit()
conn.close()
print(f"Cleaned data has been successfully stored in the database '{db_file_path}' in the table '{table_name}'.")
# Store the cleaned data
store_data_in_db(df_cleaned, 'cleaned_data.db', 'cleaned_transaction_data')
Empowering Data with Purpose
Streamlit
Streamlit is an open-source app framework generally used by Machine Learning and Data Science teams. I don’t think they should keep it to themselves because it’s the personification of a chef’s kiss: beautiful, custom web apps to interpret data with minimal effort. That said, while Streamlit is fantastic for simple, single-page applications, it may not be the best choice for complex, multi-page applications. For more intricate projects, other frameworks might be more suitable.
Key Features
- Interactive Visualizations: Pygwalker provides a wide range of interactive charts and graphs that can be customized to fit your data exploration needs.
- Ease of Use: The library is designed to be intuitive and easy to use, allowing users to create visualizations with minimal code.
- Integration with Pandas: Pygwalker integrates seamlessly with Pandas, the popular data manipulation library in Python, making it easy to visualize data directly from Pandas DataFrames.
- Customizability: It offers extensive customization options, allowing users to tweak the appearance and behavior of their charts to suit their preferences.
- Web-Based Visualization: The visualizations created with Pygwalker are rendered as web components, which means they can be easily embedded in web applications and shared with others.
Use Cases
- Data Exploration: Quickly create visual representations of your data to identify trends, patterns, and outliers.
- Reporting: Generate interactive reports that can be shared with stakeholders.
- Presentations: Enhance your presentations with dynamic and engaging visualizations.
- Dashboards: Build interactive dashboards to monitor key metrics and KPIs.
How do we use it?
We’re going to combine everything we’ve stepped through and create a dataframe using what is now stored in our database:
import pandas as pd
import sqlite3
import streamlit as st
import pygwalker as pyg
def load_data_from_db(db_file_path, table_name):
# Connect to the SQLite database
conn = sqlite3.connect(db_file_path)
# Load data from the specified table
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
# Close the database connection
conn.close()
return df
# Load cleaned data from the database
df_from_db = load_data_from_db('cleaned_data.db', 'cleaned_transaction_data')
# Create a Pygwalker dashboard with Streamlit
st.title("Transaction Data Dashboard")
# Display the dataframe
st.write(df_from_db)
# Create a Pygwalker dashboard
pyg.walk(df_from_db)
# Run Streamlit
if __name__ == "__main__":
st._main_run_clExplicitly("streamlit.cli", "main")
The script above does the following:
- Loads Data: It defines a function to load cleaned transaction data from an SQLite database into a pandas DataFrame.
- Displays Data: It uses Streamlit to create a web app that displays the DataFrame.
- Creates Dashboard: It leverages Pygwalker to generate an interactive dashboard from the data.
- Runs App: It runs the Streamlit app to provide an interactive interface for data visualization.
streamlit run app.py
And viola. This shouldn’t take you more than 30 minutes to complete at a $70 discount over tableau (which still requires cleaning data!).
FAQ
How can I replicate this step by step without the commentary?
I’ve posted this entire walkthrough to GitHub, where you can copy the repository and navigate step by step through the exercises above.
Did this cost anything?
Outside of an overarching GPT subscription, this cost absolutely nothing to reproduce.
What if I have questions?
Feel free to post questions in the comments, github, or email me directly at travis@vasceannie.com
I’ve posted this entire walkthrough to GitHub, where you can copy the repository and navigate step by step through the exercises above.
Outside of an overarching GPT subscription, this cost absolutely nothing to reproduce.
Feel free to post questions in the comments, github, or email me directly at travis@vasceannie.com
8 Responses
This was a very good post. Check out my web page Webemail24 for additional views concerning about Coupon Codes.
prijs van medicijnen in Marokko Biogaran Reuver Erfahrungen mit Medikamente bestellen
Nee bedankt, ik heb een gezonde gezondheidsverklaring
laskartogel
Good day! This post couldn’t be written any better! Reading this post reminds
me of my old room mate! He always kept talking about this.
I will forward this page to him. Fairly certain he will have a good read.
Thanks for sharing!
j200m j200m j200m j200m
Heya are using WordPress for your blog platform?
I’m new to the blog world but I’m trying to get started
and create my own. Do you need any coding knowledge to make your own blog?
Any help would be really appreciated!
akun demo slot akun demo slot akun demo slot akun demo slot akun demo slot
Excellent post! We will be linking to this particularly great content on our website.
Keep up the good writing.
imbagacor imbagacor imbagacor imbagacor
imbagacor
Good day I am so delighted I found your website, I really found you by error, while I
was browsing on Yahoo for something else, Nonetheless I
am here now and would just like to say kudos for a incredible post and
a all round thrilling blog (I also love the theme/design), I don’t have time to
read through it all at the moment but I have bookmarked it and also added your
RSS feeds, so when I have time I will be
back to read a lot more, Please do keep up the superb b.
sgcwin88 sgcwin88 sgcwin88 sgcwin88
sgcwin88
Hello there, just became aware of your blog through Google,
and found that it is truly informative. I am going to watch out for brussels.
I will appreciate if you continue this in future.
Numerous people will be benefited from your writing. Cheers!