Day 11 - Machine Learning Engineer Journey

Today’s Tasks:
PowerBI Course, Intro to Power Query: Beginning to 'Column Profiling'
Data Manipulation in SQL: Chapter One, Video Two Review
Data Science for Business: Chapter Two, 'Business Understanding', pg 28
Practical SQL, Chapter Four: Beginning to 'Understanding Characters' pg 48
PowerBI: Intro to Power Query
From the assigned section on PowerBI, I was able to get an understanding of what the interface of Power Query looks like. Here are some notable features takeaways:
Editing datafile on Power Query does not affect the original CSV file
Easily remove applied changes on columns/dataset by deleting it off ‘Applied Steps’
Column Profiling is useful for examining data for errors & inconsistencies
Column Statistics: Summary of values in column
Distinct Values: Counts total number of different values in column
Unique Values: Counts values that appear only once in column
Value Distribution: Visualizes frequency of each value
Data Manipulation in SQL
Today was a review day on video two, chapter one of DataCamp’s Data Manipulation in SQL. Here is what I learned:
Objective:
“ Barcelona and Real Madrid have been rival teams for more than 80 years. Matches between these two teams are given the name El Clásico (The Classic). In this exercise, you will query a list of matches played between these two rivals. You will notice in Step 2 that when you have multiple logical conditions in a CASE statement, you may quickly end up with a large number of WHEN clauses to logically test every outcome you are interested in. It's important to make sure you don't accidentally exclude key information in your ELSE clause. In this exercise, you will retrieve information about matches played between Barcelona (id = 8634) and Real Madrid (id = 8633 ). Note that the query you are provided with already identifies the Clásico matches using a filter in the WHERE clause “
How I solved it:
I reorganized the objective to know what it’s really asking me to do:
Need to return list of El Classico matches between Barcelona, id = 8634, & Madrid, id = 8633. Use CASE to label ‘home_team’ either ‘Barcelona’ or ‘Madrid’. Then, do the same for ‘away_team’
Here is the SELECT statement using CASE:
SELECT date,
CASE WHEN hometeam_id = 8634 THEN ‘FC Barcelona’
WHEN hometeam_id = 8633 THEN ‘Real Madrid CF’
ELSE ‘Other’ END AS home_team,
CASE WHEN awayteam_id = 8634 THEN ‘FC Barcelona’
WHEN awayteam_id = 8633 THEN ‘Real Madrid CF’
ELSE ‘Other’ END AS away_team,
FROM matches_spain
WHERE (hometeam_id = 8634 OR hometeam_id = 8633) AND
(awayteam_id = 8634 OR awayteam_id = 8633)
Breakdown:
SELECT clause: Select clause creates the columns. In this instance, we were asked to return three columns: Date, home_team, & away_team. It’s important to realize that the CASE statement in this case is creating different columns. That was something I did not catch the first time. I did not initially notice that the CASE statement was creating the columns ‘home_team’ & ‘away_team’. Basically, what’s happening here is given either id 8634 or 8633, it will define its appropriate team as the home team depending on the match. So if the match says that home team was '8634, then now we know that it’s referring to ‘FC Barcelona’, so when we run the query, instead of showing 8634 as the ‘name’ of the team, now it will display the actual team name. Why do I say this? Because if you were to query the matches_spain table without the CASE statement, then it will return the ID of the team, not the actual name since the actual name is stored in a different table (we do not need to call the other table in this instance).
FROM clause: We are telling SQL: ‘Pull the data from the matches_spain table’
WHERE clause: In the WHERE filter, we are telling SQL to only include/return the matches where the home team is EITHER (conditional OR— only one can be true) id 8634 OR id 8633 & then the AND conditional statement says that the same conditions are are required for the away team.
Data Science for Business, Chapter 2: 'Business Understanding'
The ‘Business Understanding’ section in Chapter two, in essence, covered how to understand what the business problem needs.
So, what is the business understanding phase? Simply put: Given the business problem, how can we transform this business problem into a data science problem? In this first stage, it is important for the designers to think carefully about the problem being solved. In this early stage of the business problem, we design a solution that integrates data mining solutions (Machine Learning algorithms & models).
Designers should ask the following questions:
‘What exactly do we want to do?’
‘How exactly do we want to do it?’
'What parts of this use scenario constitute possible data mining models?'
Note: This book uses ‘Data Mining Models’ to refer to Machine Learning Models/Algorithms.
So, why should we frame this business problem into a data science problem?
Because it allows us to systemically decompose the business problem into data mining tasks.
Practical SQL, Chapter 4: Beginning to 'Understanding Characters'
Understanding data types is important to maintain data base consistency & integrity. In PostgreSQL, data types can fall into either of these three categories: Characters, Numbers, & Dates & Times. In this chapter, we focused only on the ‘Characters’ category.
Character string datatypes are flexible because they can be numbers (as strings), symbols, & characters. Let’s begin by explaining the different types of character formats:
char(n): Where the n is a fixed number (defined by developer), char(n) stores ‘n’ amount of characters even if the inputed amount of characters does not amount to ‘n’.
Here’s an example:
Suppose I create a column with the data type char(10). This means 10 characters is allocated to the input. If I were to insert the text ‘hello’, since it’s a char data type, the system will automatically fill the remaining 5 characters with blank space. So, really what we have is: ‘hello ‘ (Because 10 - 5 = 5 remaining character spaces). This could potentially use more storage specially if the database has thousands, or millions of rows.varchar(n): Similar to char(n), the user defines the limit of how long the character can be in the column. However, varchar differs in that, for example, given varchar(20), PostgreSQL will not fill the left over character allocation with empty spaces.
So, back to our ‘hello’ example, if we had varchar(10) & we insert ‘hello’ into the data table, the remaining 5 character spaces will not be filled in with empty spaces so we have: ‘hello’ NOT ‘hello ‘text: Data type has unlimited length, Could be practical if unsure of how long you want a character to be.
Here is an example of the three character types:
Notice how ‘char_column’ has ‘abc ‘ unlike varchar_column that has ‘abc’.
Subscribe to my newsletter
Read articles from Sebastian directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
