Analytical April: Learning Python for Data Science
Python is a powerful programming language that has become increasingly popular over the years. It's a great language for beginners and experienced programmers alike. Its vast array of libraries and frameworks makes it an excellent choice for data science and data analysis. That's why I decided to try it out in this month's #12in23 challenge for Analytical April.
If you're not familiar with the #12in23 Challenge, it's a programming-language learning challenge that tasks you with trying out 12 different languages in the year of 2023. It's coordinated by Exercism.org.
The idea is simple: each month, you'll focus on learning a new programming language. We'll provide resources and support to help you along the way, and you can track your progress and connect with other participants on our online community.
I'm documenting my experience with this challenge throughout the year.
Each month has a specific theme in mind. For April, the theme is Analytical April. The focus is on languages that excel at data science. You can work toward the goal by completing exercises from Julia, Python, or R.
I have used a bit of Python and R, so the natural choice would've been to go with Julia. However, due to the fact that I haven't used Python in a long time or very extensively, and that it is one of the most popular programming languages out there, especially for data science...
I chose Python!
Python is popular. It is consistently one of the most used languages in the world. Plus, the people who use Python love it. It has a simple syntax which makes it easy to learn and use. Features like high-level data structures, garbage collection, and dynamic typing make it even easier. It is a general programming language with support for many systems. Couple that with its impressive standard library, and you can see why Python can be found everywhere: it is extremely versatile. Its large and friendly community helps Python stand out amongst other programming languages. If you decide to learn Python, you'll be embarking on an endeavor with a delightful language backed by a very active community for support.
Of course, there are other reasons to learn Python, like its interoperability with other languages (like C and Java) and its scalability (as proven by its use in big tech firms like Google and Meta). I could go on and on about its virtues, but I think its best to continue with the point of this article.
This time again, I'll be installing and setting up a local Python development environment for this month. The main reason for doing this is so I can have access to more powerful developer tooling. I want it to be as easy as possible to write code, debug it, run tests faster, and just get a better sense of the real-world experience developers might have with the language in general.
Hello, World!
Exercism tracks always start with a required "Hello, World!" exercise just to make sure you can complete and submit everything okay. They gave me this code to start with.
def hello():
return 'Goodbye, Mars!'
This one was incredibly simple. I just had to make the program return "Hello, World!" instead of "Goodbye, Mars!" so I just had to change that one string.
def hello():
return 'Hello, World!'
With Hello, World! out of the way, let's get started solving 5 exercises so we can complete this month's challenge.
I'll be tackling each of the 5 "featured exercises." These exercises are chosen by Exercism because they particularly highlight a scenario where this month's language theme is beneficial. While you can complete the monthly challenge by completing any 5 exercises, you will get an extra-special "year-long" #12in23 badge for completing the featured exercises.
The 5 exercises that exemplify Analytical languages are ETL, Largest Series Product, Saddle Points, Sum Of Multiples, and Word Count. Let's start working on those.
ETL
ETL stands for Extract-Transform-Load. It's a common pattern when working with data. Say you have some data in an old system and you need to move it to a new system. First, you extract the data from the old system, then you transform it in some useful way - at least to make it compatible with the new system - and then you load it into the new system.
In this case, we are given a set of data that stores a list of scores and the letters that correspond to them (think "Scrabble").
1 point: "A", "E", "I", etc.
2 points: "D", "G"
3 points: "B", "C", "M", etc.
and so on
We need to transform this data into a new format. One that uses a list of letters with their corresponding score.
A: 1 point
B: 3 points
C: 3 points
and so on
To get us started, the exercise gives us a function that we have to fill in.
def transform(legacy_data):
pass
My solution uses nested for
loops to pull the letters from each score. It places the letters in an accumulating dictionary that we eventually return.
def transform(legacy_data):
# Create an empty dictionary
result = {}
# Loop through all of the keys in `legacy_data`. Each key is a point
# value
for points in legacy_data:
# Each key (point value) corresponds to an array of letters. Loop
# through the array of letters
for letter in legacy_data[points]:
# Assign the point value we're current in to the key
# `letter.lower()` in our `result` dictionary. `.lower()` ensures
# that all letters in the result are lowercase
result[letter.lower()] = points
# Return the dictionary filled with letters and corresponding point
# values
return result;
Largest Series Product
For this exercise, you must write a function that takes two arguments:
A string of numbers such as
1027839564
called theseries
A number such as
3
called thesize
The function should return the largest product for a contiguous substring of series
of length size
.
Using the same numbers as above for example, the substring of size 3
which produces the largest product is 956
. 9 x 5 x 6 = 270
, which is the number that would be returned in this case.
In my solution, I first check for a few error conditions and raise
exceptions for each one.
Then I start at index 0
and find the product for all digits after it up to the given size
. If that product is larger than the currentlargest_product
, I save it as the new largest_product
. Then I start another iteration starting at index 1
. I keep looping through until all substrings in the series
are checked. At the end, I return the largest_product
.
def largest_product(series, size):
# If the size is negative, raise an error
if size < 0:
raise ValueError("span must not be negative")
# If the series is empty and the size is positive, raise an error
if series == "" and size > 0:
raise ValueError("span must be smaller than string length")
# If the series is not empty but cannot be converted to an integer, raise
# an error
try:
series != "" and int(series)
except ValueError:
raise ValueError("digits input must only contain digits")
# If the size is bigger than the series, raise an error
if size > len(series):
raise ValueError("span must be smaller than string length")
# Special case: If the series is empty and the size is zero, the largest
# product is 1
if series == "" and size == 0:
return 1
# Before any series are calculated, 0 is the largest product
largest_product = 0
# Loop through a range starting at 0 and ending at the end of the series,
# minus the size
for i in range(len(series) - size + 1):
# Get a slice of the series that starts at the current position and
# is the length of the given size
subseries = series[i:i + size]
# The product starts at 1
product = 1
# For each digit in the slice...
for digit in subseries:
# Multiply the current product by the digit
product *= int(digit)
# If the calculation resulted in a product larger than the current
# largest, set the new product as the largest
if product > largest_product:
largest_product = product
# Return the largest product
return largest_product
Saddle Points
I had never heard of a "saddle" point before, but this exercise gave a strict definition of how to determine one in this case.
Given a matrix like this:
9 8 7
5 3 2
6 6 7
Find any position where the number in that position is the largest of any other in its row and it is the smallest of any other in its column. In the example matrix above, the 5
that is at row 2, column 1 meets this criteria, so it is a saddle point.
For my solution, I first saved the largest number for each row in a list. Then I stored the smallest number for each column in a list. With that information ready, I used nested for
loops to access each number, checking if it was the largest in its row and smallest in its column. If so, I appended the row
and column
indexes to a list called saddle_points
. When the loops finish, the saddle_points
list is complete and I return it.
def saddle_points(matrix):
# If the given matrix is empty, return an empty list
if len(matrix) == 0:
return [];
# Use `*` to unpack the list of lists, then `zip` them together. This
# transposes (or "rotates") the matrix. Now each "row" is a "column".
cols = zip(*matrix)
# Map over each row and get the largest number
row_maxs = list(map(max, matrix))
# Map over each col and get the smallest number
col_mins = list(map(min, cols))
# Create an empty list that we can append to
saddle_points = []
# Loop over each row
for row_index, row in enumerate(matrix):
# If the current row isn't the same length as the first row, the
# matrix doesn't have an equal length in each row. It is irregular;
# raise an exception.
if len(row) != len(matrix[0]):
raise ValueError("irregular matrix")
# Loop over each number in the current row
for col_index, n in enumerate(row):
# If the current number is greater than or equal to the largest
# number in the row AND it is less than or equal to the smallest
# number in the column...
if n >= row_maxs[row_index] and n <= col_mins[col_index]:
# ...a saddle point was found. Append the row and column
# index to the list
saddle_points.append({ "row": row_index + 1, "column": col_index + 1})
# Return the list of saddle points
return saddle_points;
Sum Of Multiples
In this exercise, you are given a list of factors
and a limit
. You must add up all the unique multiples of the factors
, up to the limit
. So if the factors are [3, 5]
and the limit is 20
...
We find the multiples of 3 that are less than 20: 3, 6, 9, 12, 15, 18
Find the multiples of 5 that are less than 20: 5, 10, 15
Compile a list of unique multiples: 3, 5, 6, 9, 10, 12, 15, 18
Sum them: 78
My solution uses a for
loop to work over all given factors
, it then starts a while
loop that runs while the current multiple
is under the given limit
. It appends that multiple
to a list, then moves on to the next multiple.
After the loops are complete, I remove duplicates by converting the list to a set
and then back to a list
. Finally, I return the sum
of the list.
def sum_of_multiples(limit, factors):
# Create an emply list that will be used to contain multiples
multiples = []
# For each factor given...
for factor in factors:
# If the factor is 0, go to the next factor
if factor == 0:
continue
# Set/reset `multiples` to 0
multiple = 0
# While the current multiple is less than the given limit...
while multiple < limit:
# Append that multiple to the list of multiples
multiples.append(multiple)
# Increase the multiple by the current factor and then loop again
multiple += factor
# Remove duplicates by converting the list to a set and then back to a
# list
unique_multiples = list(set(multiples))
# Return the sum of the list of unique multiples
return sum(unique_multiples)
Word Count
The last exercise was to take a sentence and return a list of all the words in that sentence, along with how many times the word appears.
For this one, I...
Converted the sentence to all lowercase letters
Used a regular expression (regex) to remove any non-alphanumeric or apostrophe characters
Split the sentence on whitespace to get a list of words
Create a new empty dictionary
Use a
for
loop over each wordIf the word already exists in the dictionary, increment its number by 1
If it doesn't, add it to the dictionary with a value of 1
Return the dictionary
# Import `re` to enable the use of `re.sub()`
import re
def count_words(sentence):
# Convert the entire sentence to lowercase
lower_sentence = sentence.lower()
# Use `re.sub()` and a regular expression to remove apostrophes that
# appear at the beginning or end of any word, and also remove any
# character that isn't a lowercase letter, a number, whitespace, or
# apostrophe. The removed characters are replace by a space
clean_sentence = re.sub(r"(?<!\w)'|'(?!\w)|[^a-z0-9\s']+", " ", lower_sentence)
# Split the sentence on whitespace
split_sentence = clean_sentence.split()
# Create an empty dictionary of words
words = {}
# Loop through each word in the sentence
for word in split_sentence:
# If the current word is already in the dictionary...
if word in words:
# Increment that word's count by 1
words[word] += 1
# If it's not in the dictionary yet...
else:
# Add it to the dictionary with a count of 1
words[word] = 1
# Return the dictionary of words
return words
Four months down
With those five exercises in Python complete, I'm finished with Analytical April! Eight more months and eight more languages to go.
In addition, I've published solutions to all 15 of the featured exercises so far. This progress brings me closer to the exclusive year-long #12in23 badge.
Overall, Python was a great experience. What strikes me most about the language is that the syntax is easy to use. While reading the problems for each exercise, solutions in Python seemed to naturally write themselves in my mind. A little bit of tweaking and debugging later, and I've got a working solution that passes all the tests. I can see myself using this language again when the need arises. There are many times I reach for JavaScript when working with data, if only for the reason that my data is already accessible by whatever JavaScript-based system I'm using. However, there are times when bringing in Python could be very beneficial. I'll have to keep this language in mind for times like those.
I've posted the solutions in this article on GitHub with much more detailed comments in the code.
But I highly suggest you try to solve some of the exercises on Exercism.org on your own. Won't you join me in trying 12 new programming languages in 2023? I'm curious to hear about your experience with this challenge. Leave a comment and let me know what strategies have worked for you or which languages you're most excited to learn.
Subscribe to my newsletter
Read articles from Travis Horn directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Travis Horn
Travis Horn
I have a passion for discovering and working with cutting-edge technology. I am a constant and quick learner. I enjoy collaborating with others to solve problems. I believe helping people achieve their goals helps me achieve mine.