Week 12: Data Mapping & the NULL value in SQL

Week 12: Data Mapping & the NULL value in SQL

Quick summary of data cleaning:

Ask yourself these questions next time you clean data

  1. What is in your data and what's not?

  2. Is it reliable and does it make sense?

  3. Is it consistent in categories, measurements and standards?

  4. Any typo?

While you go through the checklist, remember to always document your data cleaning steps for reproducibility and auditing purposes. There may be preferred methods for cleaning data in certain conditions, planning ahead will save you time and energy when specific issues present. Testing and validation should be conducted after the cleaning is completed.

Data Mapping:

Data mapping is a critical process in a variety of data-related activities. It is the process of linking data from one information system to the equivalent data in another system, such as data migration, data integration, and data warehousing. This implies, however, your data size is getting bigger as you map more datasets together. It's important to always be prepared for human errors in the process and practice back up and using a change log to track your action. In general, these are the main steps you will want to follow when performing data mapping:

  1. Identifying the data to be mapped

  2. Understanding the source and target data schemas

  3. Creating the mapping

  4. Testing the mapping

  5. Executing the mapping

Let's walk through them together, say there's a old_customers table that you want to map to a new_customers table:

  1. Select all fields from old_customers table.

     SELECT * FROM old_customers;
    
  2. Review schema of old_customers and new_customers via Database Management System's schema management tools

  3. Creating the mapping: Create SQL query to map and transform the data

     SELECT old_id AS new_id, UPPER(old_name) AS new_name
     FROM old_customers;
    
  4. Test the mapping query with a limited set of data.

     /* The query is same as above but test by only returning a small subset */
     SELECT old_id AS new_id, UPPER(old_name) AS new_name
     FROM old_customers
     LIMIT 10;
    
  5. Execute the mapping and insert data into new_customers.

     INSERT INTO new_customers (new_id, new_name)
     SELECT old_id, UPPER(old_name) FROM old_customers;
    

    This can be confusing since the code doesn't show any direct link between the old and new data, but pay attention to these key concepts:

    1. When you use INSERT INTO table (column1, column2) SELECT ... syntax, the column1, column2 part should use the actual column names from the destination table. This means inside your new_customers table there will be a new_id and new_name columns after you run the code.

    2. Step 5, the execution of the mapping itself is isolated from steps 1-4. The first 4 steps are used for planning and verification to maintain data integrity. Step 5 is the culmination of these previous steps.

    3. The code essentially says:
      'Take the old_id and old_name (which will be transformed to upper case in the process) values from the old_customers table, and insert them into the new_id column in the new_customers table.'

SQL: ISNULL( ) vs COALESCE( ):

I want to share one or two SQL functions every week going forward, this week we'll start with ISNULL() and COALESECE(). Both are SQL functions that serve similar purposes, both allow you to handle NULL values in SQL. However, there are some differences between them:

  1. Arguments: ISNULL() takes only two arguments, the value to check and the value to return if the first is NULL. COALESCE(), on the other hand, can take two or more arguments, and returns the first non-NULL argument in the list.

    Example:

     SELECT ISNULL(age, 'No data')
     FROM customer; 
     -------------------------------------------------------------------
     SELECT COALESCE(first_name, middle_name, last_name, 'No data') 
     FROM customer;
    

    Explanation:
    The first statement will return whatever that "is NULL" from the age column and will instead return 'No data'. However, if the original value is not NULL, it returns the original value.

    The second statement will return the first non-NULL value it encounters when it checks first_name, middle_name and last_name in that order. If all three columns are NULL, it will return 'No data'.

  2. Standardization: ISNULL() is specific to certain SQL dialects like Microsoft SQL Server and MySQL. COALESCE() is a part of the SQL standard and is more widely supported

  3. Return Type: ISNULL() returns the same type as the first argument. In contrast, COALESCE() returns the type with the highest precedence from the list of arguments. The precedence of data types is defined by the database management system (DBMS). Each DBMS has a specific data type precedence hierarchy. So without going too deep and complicated, just understand that it's a list from top to bottom, some data type has higher precedence than others.

ML Project - Personal Assistant:

This week I created a 2.0 version of my personal assistant with the LangChain tool in Python, the outcome is still not that great, but it's an improvement from my first iteration: Personal Assistant 1.0 (link to GitHub). The snippet below shows the very basics of the LangChain implementation and how it essentially takes care of the overhead, all you worry about is a literal "chain", a chain of actions. I will discuss more during my 3rd monthly update, finger-crossed version 3.0 would be completed by then.

template = """
You are my writing assistant. 
Write about {c1}, {c2}, {c3}, {c4} and {c5} with explanations and examples, in a {project} style."
"""
llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
prompt = PromptTemplate(input_variables=["project", "c1", "c2", "c3", "c4", "c5"],
                        template=template
                        )
chain = LLMChain(llm=llm, prompt=prompt)

Speaking of GitHub, I dedicated some time this week to reorganizing my cluttered repository. What I discovered is that I had two different main branches, one in VS Code which is connected to my GitHub and a different one, master, in my remote GitHub repository. This explains a lot of the complications I'd been encountering. So, I opted for a straightforward solution and merged the branches. Moving forward, I plan on learning more about GitHub. This exploration will likely be summarized in a detailed blog post, sharing my experiences and newfound skills.

CS50:

I've been learning HTML, CSS and Javascript in week 8 for a week now. While I agree its indispensable role in web development, Javascript is not my cup of tea and I'll leave it at that. But I'm glad I'm close to the end of this course.

That's everything I have for this week, thanks for reading!