A "Really Basic" Encoder and Decoder


We often encounter encoders and decoders while learning about machine learning. Here is a basic encoder and decoder code to help understand these topics.
A basic layout and quick understanding
The encoder converts the input sequence to a vector representation, where the decoder converts a hidden vector to an output sequence.
Implementation
Encoder
def encode(self, input):
vector = []
for ch in input:
vector.append(ord(ch))
return vector
Here, the encode function has a parameter input, which is a string. A list with the name vector is declared, and the Unicode code of all the characters in the input string is appended to the vector. Finally, the function returns the encoded value.
Decoder
def decode(self, vector):
res = ""
for num in vector:
res += chr(num)
return res
In the decode function, a parameter named vector is present, which is the encoded vector. Decoding is the reverse process of encoding. So here all the Unicode codes are converted back into characters using the chr() function and appended into a string named res.
Entire code
class Tonkenizer:
def encode(self, input):
vector = []
for ch in input:
vector.append(ord(ch))
return vector
def decode(self, vector):
res = ""
for num in vector:
res += chr(num)
return res
tokenizer = Tonkenizer()
encodedData = tokenizer.encode("Hello, World!")
print(encodedData)
decodedData = tokenizer.decode(encodedData)
print(decodedData)
The encoder and decoder functions are encapsulated inside a class named Tonkenizer. By making an object of the class, the class’s methods(encode and decode) are used. The encoded and decoded values are printed.
Dry Run and Output
“Hello, World!” is passed as an argument for the encode function, a list named vector is declared, and each character in “Hello, World!” is converted to Unicode code(H → 72, e → 101, l →108 … ) and is appended to the vector.
The returned value from the encode function is stored in the variable encodedData and is printed.
The encodedData is passed as an argument for the decode method. An empty string named res is declared, and all the Unicode codes are converted to characters and added to res. Finally, the result is returned.
The returned value from the decode function is stored in the variable decodedData and is printed.
Here is the output of the code.
Conclusion
Encoding and decoding are fundamental processes in programming. In the above code, we performed character encoding and decoding using Python and got a basic understanding of encoding and decoding. Various other encoding and decoding methods are designed for different data and use cases, which can be further explored.
Subscribe to my newsletter
Read articles from Nitesh Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
