Quick summary of data cleaning:
Ask yourself these questions next time you clean data
What is in your data and what's not?
Is it reliable and does it make sense?
Is it consistent in categories, measurements and standards?
Any typo?
While you go through the checklist, remember to always document your data cleaning steps for reproducibility and auditing purposes. There may be preferred methods for cleaning data in certain conditions, planning ahead will save you time and energy when specific issues present. Testing and validation should be conducted after the cleaning is completed.
Data Mapping:
Data mapping is a critical process in a variety of data-related activities. It is the process of linking data from one information system to the equivalent data in another system, such as data migration, data integration, and data warehousing. This implies, however, your data size is getting bigger as you map more datasets together. It's important to always be prepared for human errors in the process and practice back up and using a change log to track your action. In general, these are the main steps you will want to follow when performing data mapping:
Identifying the data to be mapped
Understanding the source and target data schemas
Creating the mapping
Testing the mapping
Executing the mapping
Let's walk through them together, say there's a old_customers
table that you want to map to a new_customers
table:
Select all fields from
old_customers
table.SELECT * FROM old_customers;
Review schema of
old_customers
andnew_customers
via Database Management System's schema management toolsCreating the mapping: Create SQL query to map and transform the data
SELECT old_id AS new_id, UPPER(old_name) AS new_name FROM old_customers;
Test the mapping query with a limited set of data.
/* The query is same as above but test by only returning a small subset */ SELECT old_id AS new_id, UPPER(old_name) AS new_name FROM old_customers LIMIT 10;
Execute the mapping and insert data into
new_customers
.INSERT INTO new_customers (new_id, new_name) SELECT old_id, UPPER(old_name) FROM old_customers;
This can be confusing since the code doesn't show any direct link between the old and new data, but pay attention to these key concepts:
When you use
INSERT INTO table (column1, column2) SELECT ...
syntax, thecolumn1
,column2
part should use the actual column names from the destination table. This means inside yournew_customers
table there will be a new_id and new_name columns after you run the code.Step 5, the execution of the mapping itself is isolated from steps 1-4. The first 4 steps are used for planning and verification to maintain data integrity. Step 5 is the culmination of these previous steps.
The code essentially says:
'Take theold_id
andold_name
(which will be transformed to upper case in the process) values from theold_customers
table, and insert them into thenew_id
column in thenew_customers
table.'
SQL: ISNULL( ) vs COALESCE( ):
I want to share one or two SQL functions every week going forward, this week we'll start with ISNULL() and COALESECE(). Both are SQL functions that serve similar purposes, both allow you to handle NULL values in SQL. However, there are some differences between them:
Arguments:
ISNULL()
takes only two arguments, the value to check and the value to return if the first is NULL.COALESCE()
, on the other hand, can take two or more arguments, and returns the first non-NULL argument in the list.Example:
SELECT ISNULL(age, 'No data') FROM customer; ------------------------------------------------------------------- SELECT COALESCE(first_name, middle_name, last_name, 'No data') FROM customer;
Explanation:
The first statement will return whatever that "is NULL" from theage
column and will instead return 'No data'. However, if the original value is notNULL
, it returns the original value.The second statement will return the first non-NULL value it encounters when it checks
first_name
,middle_name
andlast_name
in that order. If all three columns areNULL
, it will return'No data'
.Standardization:
ISNULL()
is specific to certain SQL dialects like Microsoft SQL Server and MySQL.COALESCE()
is a part of the SQL standard and is more widely supportedReturn Type:
ISNULL()
returns the same type as the first argument. In contrast,COALESCE()
returns the type with the highest precedence from the list of arguments. The precedence of data types is defined by the database management system (DBMS). Each DBMS has a specific data type precedence hierarchy. So without going too deep and complicated, just understand that it's a list from top to bottom, some data type has higher precedence than others.
ML Project - Personal Assistant:
This week I created a 2.0 version of my personal assistant with the LangChain tool in Python, the outcome is still not that great, but it's an improvement from my first iteration: Personal Assistant 1.0 (link to GitHub). The snippet below shows the very basics of the LangChain implementation and how it essentially takes care of the overhead, all you worry about is a literal "chain", a chain of actions. I will discuss more during my 3rd monthly update, finger-crossed version 3.0 would be completed by then.
template = """
You are my writing assistant.
Write about {c1}, {c2}, {c3}, {c4} and {c5} with explanations and examples, in a {project} style."
"""
llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
prompt = PromptTemplate(input_variables=["project", "c1", "c2", "c3", "c4", "c5"],
template=template
)
chain = LLMChain(llm=llm, prompt=prompt)
Speaking of GitHub, I dedicated some time this week to reorganizing my cluttered repository. What I discovered is that I had two different main branches, one in VS Code which is connected to my GitHub and a different one, master, in my remote GitHub repository. This explains a lot of the complications I'd been encountering. So, I opted for a straightforward solution and merged the branches. Moving forward, I plan on learning more about GitHub. This exploration will likely be summarized in a detailed blog post, sharing my experiences and newfound skills.
CS50:
I've been learning HTML, CSS and Javascript in week 8 for a week now. While I agree its indispensable role in web development, Javascript is not my cup of tea and I'll leave it at that. But I'm glad I'm close to the end of this course.
That's everything I have for this week, thanks for reading!