Learning Python: Week 1 - Core Concepts for Automation

MRIDUL TIWARIMRIDUL TIWARI
23 min read

I started learning Python a while ago, but I wanted to learn more about what I can do with it to automate my daily tasks and gain a better understanding of Python's potential. So, I started writing scripts to solve daily-life problems.

This article covers the concepts I learned and felt were important before starting my journey to “Automation with Python.”

I covered my basic concepts from the “Chai aur code” Playlist, which built my foundation to build further (reference: Chai aur Python Playlist )

Internal Working of Python

  • Python language needs a Python interpreter to compile and run the code

    How Python interpreter works?

  1. Compile to ByteCode ⇒ intermediary step

    • ByteCode⇒ low-level and platform-independent
  • ByteCode runs faster than script cause mostly the checks and other things have already been done when compiled to bytecode

  • .pyc → These are compiled Python ( also called FROZEN BINARIES)

    What is pycache? (__pychache__)

    • a system folder to organize all versions and changes in code

    • These underscores before and after pycache represent that this is for Python’s internal use.

    • hello_chai.cpython-312.pyc What does the name signify?

      • source change & python version

      • Python uses a diffing algorithm to find the difference between the last code and the updated code to update in pycache file

      • cpthon Is the type of Python interpreter we usually use for standard Python, and this 312 Is the version installed on your system 3.12 (Python

      • version)

      • This .pyc file works only for imported files

      • not for top-level files or when you have only one file

  1. Python Virtual Machine (PVM)
  • It is a software that continuously loops the code to iterate bytecode

  • You can also run direct Python scripts in PVM

  • It has a run-time engine, also known as the Python interpreter

  • Byte Code is not machine Code, it’s a Python-specific interpretation

  • cpython (standard Implementation), Jython (with Java Binaries), Iron Python, stackless, PyPy

Mutable vs Immutable

MutableImmutable
ListInteger
SetFloating-point numbers
DictionaryBoolean
ByteArrayStrings
ArrayTuples
Frozen set
Bytes
  • Python has almost every datatype as an object, eg→, String object, float object

  • The value we store is immutable in a string, and not the variable that is referring to that string

OOPS

a way of organizing code that uses objects and classes to represent real-world entities and their behavior

Basic Class and Object

  • class→ collection of objects or blueprints for creating objects

  • defines a set of attributes (properties) and methods that the created object can have

  • Imp Points

    • Created via keyword class

    • Attributes = variables that belong to a class

    • Attributes are always public and can be accessed via the dot (.) operator

    • Class names should always be capitalized

  • Object → instance of class

    • represents a specific implementation of a class and holds its data

    • state → represented by an attribute and reflects properties of an object

    • behavior → represented by methods of an object and reflects the response of the object

    • identity → gives a unique name to an object and enables one object to interact with other objects

  • self parameter is a reference to the current instance of the class. It allows us to access the attributes and methods of the object.

  • init method is the constructor in Python, automatically called when a new object is created. It initializes the attributes of the class.

  • Class variable → variables that are shared across all the instances of a class

    • defined outside any method in a class
  • Instance variable → variable unique to each instance (object) of a class

    • defined with __init__ method or other instance methods
class Car:
    # can't write like this otherwise we won't be able to change or set values
    # brand=None
    # model=None

    # this Init method is a constructor
    engine="TOP" # class variable
    def __init__(self,brand,model): # via self we are giving context to class and variables to be accessed 
        # self is nothing but 'this' in JS
        self.brand=brand # instance variable
        self.model=model

my_car=Car("Toyota","Corolla") # Object created
print(my_car) # Output: <__main__.Car object at 0x000001E8F7F4
print(my_car.brand) # Output: Toyota

Inheritance

  • properties inherited from the parent tree

  • promotes code reuse

  • Types

    • Single Inheritance→ child class inherits from a single parent class

    • Multiple Inheritance→ child class inherits from more than one parent class

    • Multilevel Inheritance → child class inherits from parent class, which inherits from another class

    • Hierarchical Inheritance → multiple child classes inherit from a single parent class

    • Hybrid Inheritance → combination of two or more types

# Single Inheritance
class Dog:
    def __init__(self, name):
        self.name = name

    def display_name(self):
        print(f"Dog's Name: {self.name}")

class Labrador(Dog):  # Single Inheritance
    def sound(self):
        print("Labrador woofs")

# Multilevel Inheritance
class GuideDog(Labrador):  # Multilevel Inheritance
    def guide(self):
        print(f"{self.name}Guides the way!")

# Multiple Inheritance
class Friendly:
    def greet(self):
        print("Friendly!")

class GoldenRetriever(Dog, Friendly):  # Multiple Inheritance
    def sound(self):
        print("Golden Retriever Barks")

# Example Usage
lab = Labrador("Buddy")
lab.display_name()
lab.sound()

guide_dog = GuideDog("Max")
guide_dog.display_name()
guide_dog.guide()

retriever = GoldenRetriever("Charlie")
retriever.display_name()
retriever.greet()
retriever.sound()

Encapsulation

  • When you add two underscores before an attribute or variable, it becomes private

  • private as in it can be accessed within the class only, but no object can access it

  • Types of Encapsulation:

    1. Public Members: Accessible from anywhere.

    2. Protected Members: Accessible within the class and its subclasses. (single underscore before _)

    3. Private Members: Accessible only within the class. (double underscore before __)

class Car:

    def __init__(self,brand,model):
        self.__brand=brand
        self.model=model

    def full_name(self):
        return self.__brand +" "+ self.model

    def get_brand(self):
        return self.__brand+ " !"

my_car=Car("Tesla", "Model S")
print(my_car.__brand) # cam't be accessed directly cause brand is private attribute
print(my_car.get_brand()) # Output : Tesla !

Polymorphism

  • overriding or overloading the method

    allows methods to have the same name but behave differently based on the object's context

    Types

    • Compile-Time → determined during the compilation of the program

      • allows methods or operators with the same name to behave differently based on their input parameters

      • method overloading

    • Run-Time → determined during execution of the program

      • occurs when a subclass provides a specific implementation for a method already defined in its parent class

      • method overriding

    # Parent Class
    class Dog:
        def sound(self):
            print("dog sound")  # Default implementation

    # Run-Time Polymorphism: Method Overriding
    class Labrador(Dog):
        def sound(self):
            print("Labrador woofs")  # Overriding parent method

    class Beagle(Dog):
        def sound(self):
            print("Beagle Barks")  # Overriding parent method

    # Compile-Time Polymorphism: Method Overloading Mimic
    class Calculator:
        def add(self, a, b=0, c=0):
            return a + b + c  # Supports multiple ways to call add()

    # Run-Time Polymorphism
    dogs = [Dog(), Labrador(), Beagle()]
    for dog in dogs:
        dog.sound()  # Calls the appropriate method based on the object type


    # Compile-Time Polymorphism (Mimicked using default arguments)
    calc = Calculator()
    print(calc.add(5, 10))  # Two arguments
    print(calc.add(5, 10, 15))  # Three arguments

Abstraction

  • Hides internal implementation details while exposing only necessary functionality

  • Types of Abstraction:

    • Partial Abstraction: Abstract class contains both abstract and concrete methods.

    • Full Abstraction: Abstract class contains only abstract methods (like interfaces).

from abc import ABC, abstractmethod

class Dog(ABC):  # Abstract Class
    def __init__(self, name):
        self.name = name

    @abstractmethod
    def sound(self):  # Abstract Method
        pass

    def display_name(self):  # Concrete Method
        print(f"Dog's Name: {self.name}")

class Labrador(Dog):  # Partial Abstraction
    def sound(self):
        print("Labrador Woof!")

class Beagle(Dog):  # Partial Abstraction
    def sound(self):
        print("Beagle Bark!")

# Example Usage
dogs = [Labrador("Buddy"), Beagle("Charlie")]
for dog in dogs:
    dog.display_name()  # Calls concrete method
    dog.sound()  # Calls implemented abstract method

static keyword

  • There can be some functionality that relates to the class, but does not require any instance(s) to do some work; static methods can be used in such cases.

    • Objects can’t access a particular method, but the class can

    • static method is a method which is bound to the class and not the object of the class

    • uses a @static_method decorator

class Car:
    total_car=0
    def __init__(self,brand,model):
        self.__brand=brand
        self.model=model
        Car.total_car+=1
        # self.total_car+=1

    def full_name(self):
        return self.__brand +" "+ self.model

    def get_brand(self):
        return self.__brand+ " !"

    def fuel_type(self):
        return "Petrol or Diesel"

    @staticmethod
    def general_description():
        return "This is a car"

my_car=Car("Toyota","Corolla") # Object created
print(my_car.general_description()) # Gives Error
print(Car.general_description())

Make an attribute read-only

  • Using the property decorator, we can make an attribute read-only and access it just like a property

  • property makes sure you cannot overwrite that attribute

      class Car:
          total_car=0
          def __init__(self,brand,model):
              self.__brand=brand
              self.__model=model
              Car.total_car+=1
              # self.total_car+=1
    
          def full_name(self):
              return self.__brand +" "+ self.__model
    
          def get_brand(self):
              return self.__brand+ " !"
    
          def fuel_type(self):
              return "Petrol or Diesel"
    
          @staticmethod
          def general_description():
              return "This is a car"
    
          @property
          def model(self):
              return self.__model
    
      my_car=Car("Toyota","Corolla") # Object created
    
      my_car.model="City" # Gives Error that it can't be set
      print(my_car.model)
    

class inheritance and isinstance() function

using isinstance to check if the said object is an instance of a particular class or not


class ElectricCar(Car):
    def __init__(self,brand,model,battery_size):
        super().__init__(brand,model) # calling parent class constructor (Hamse uper)
        self.battery_size=battery_size

    def fuel_type(self):
        return "Electric Charge"

my_electric_car=ElectricCar("Tesla","Model S","85kWH")

print(f"{isinstance(my_electric_car,Car)} {isinstance(my_electric_car,ElectricCar)}") # True True

Multiple Inheritance

  • Multiple inheritance is possible in Python

class Battery:
    def battery_info(self):
        return "This is a battery"

class Engine:
    def engine_info(self):
        return "This is an engine"

class ElectricCar2(Battery,Engine,Car):
    pass

my_new_tesla=ElectricCar2("Tesla","Model R")
print(my_new_tesla.battery_info()) # This is battery
print(my_new_tesla.engine_info()) # This is engine

Special methods

Special methods in Python (also known as dunder methods, for “double underscore”) are methods with names like __init__, __str__, etc. They allow custom classes to integrate naturally with Python syntax and built-in functions.

  • These are executed first in the code

They’re the reason you can do:

  • len(obj)

  • obj + other

  • print(obj)

    and have them work your way.

    | Method | Signature | Explanation | | --- | --- | --- | | Returns string for a printable representation of object | __repr__(self) | repr(x) invokes x.__repr__(), this is also invoked when an object is returned by a console | | Returns string representation of an object | __str__(self) | str(x) invokes x.__str__() |

  • Mathematical Operator

    | Method | Signature | Explanation | | --- | --- | --- | | Add | __add__(self, other) | x + y invokes x.__add__(y) | | Subtract | __sub__(self, other) | x - y invokes x.__sub__(y) | | Multiply | __mul__(self, other) | x * y invokes x.__mul__(y) | | Divide | __truediv__(self, other) | x / y invokes x.__truediv__(y) | | Power | __pow__(self, other) | x ** y invokes x.__pow__(y) |

  • Container-like class

    | Method | Signature | Explanation | | --- | --- | --- | | Length | __len__(self) | len(x) invokes x.__len__() | | Get Item | __getitem__(self, key) | x[key] invokes x.__getitem__(key) | | Set Item | __setitem__(self, key, item) | x[key] = item invokes x.__setitem__(key, item) | | Contains | __contains__(self, item) | item in x invokes x.__contains__(item) | | Iterator | __iter__(self) | iter(x) invokes x.__iter__() | | Next | __next__(self) | next(x) invokes x.__next__() |

    __name__ == "__main__"

    • The whole idea behind it is , when you are importing from a module, you would like to know whether a module’s function is being used as an import, or if you are using the original .py file of that module

    • When an interpreter runs a module, the __name__ variable will be set as __main__ If the module that is being run is the main program

    • If importing the module from another module, then __name__ variable will be set to that module’s name

    # Python module to import

    print("File two __name__ is set to: {}" .format(__name__)) # return __main__

    -------------------------------------------------------
    # Python module to execute
    import file_two

    print("File one __name__ is set to: {}" .format(__name__)) # for this flle it return __main__ , for file_two it will return __file_two.py__
  • The variable __name__ for the file/module that is run will always be __main__. But the __name__ variable for all other modules that are being imported will be set to their module's name.

  • Now, usually when you don’t specify and __name__ condition your top-level code of the file will be executed as it is

  • Now, when we use if __name__ == "__main__" condition after you code, suddenly your functions, classes, etc. will be loaded but not run when you call them inside this if block, then they will run.

  • We can use an if __name__ == "__main__" block to allow or prevent parts of code from being run when the modules are imported.

Errors and Exception Handling

  • We can use error handling to attempt to plan for possible errors

  • If nothing is used, then when an error comes, the entire script will stop, and the error will be displayed to us

  • We can use Error Handling to let the script continue with other code, even if there is an error

  • We use three keywords for this

    • try → This block of code is to be attempted ( may lead to an error )

    • except → block of code executed in case there is an error in the try block

    • finally → block to be executed, regardless of an error

Pylint

Pylint is a tool that

  • Lists Errors that come after the execution of that Python code

  • Enforces a coding standard and looks for code smells

  • Suggest how particular blocks can be updated

  • Offer details about the code's complexity

  • Pylint tool is similar to pychecker, pyflakes, flake8, and mypy.

  • There are several testing tools, and we will focus on two

    • pylint → a library that looks at your code and reports back possible issues

      • pylint <file> → gives statistics and reports for the file

      • Check the documentation as to what the standard is

Decorators

  • These are essentially a function that takes another function as an argument and returns a new function

  • often used with logging, authentication, and memorization, allowing us to add additional functionality to existing functions or methods in a clean, reusable way

  • Syntax

    • The wrapper function allows the decorator to handle functions with any number and type of arguments.
    def decorator_name(func):
        def wrapper(*args, **kwargs):
            # Add functionality before the original function call
            result = func(*args, **kwargs)
            # Add functionality after the original function call
            return result
        return wrapper

    @decorator_name
    def function_to_decorate():
        # Original function code
        pass
  • Higher-order functions

    • take one or more functions as arguments, and return a function as a result

    • Properties

      • Taking functions as arguments: a higher-order function can accept other functions as parameters

      • Returning functions: can return a new function that can be called later

        # A higher-order function that takes another function as an argument
        def fun(f, x):
            return f(x)

        # A simple function to pass
        def square(x):
            return x * x

        # Using apply_function to apply the square function
        res = fun(square, 5)
        print(res)
  • Decorators are higher-order functions because they take a function as input, modify it, and return a new function

  • Functions as First-class Objects

    meaning they can be treated like any other object, like integer, string, list

    • This gives functions a unique level of flexibility and allows them to be passed around and manipulated in ways that are not possible in many other programming languages.

    • Meaning

      • Functions can be assigned to variables

      • Functions can be passed as arguments

      • Functions can be returned from other functions

      • Functions can be stored in data structures (lists, dict, etc.)

  • Type of Decorators

    • Function Decorator

      • Most common type

      • takes a function as input and returns a new function

          # Eg:
          def simple_decorator(func):
              def wrapper():
                  print("Before calling the function.")
                  func()
                  print("After calling the function.")
              return wrapper
        
          @simple_decorator
          def greet():
              print("Hello, World!")
        
          greet()
        
    • Method Decorator

      • often handle special cases such as self arguments for instance methods
        # Eg:
        def method_decorator(func):
            def wrapper(self, *args, **kwargs):
                print("Before method execution")
                res = func(self, *args, **kwargs)
                print("After method execution")
                return res
            return wrapper

        class MyClass:
            @method_decorator
            def say_hello(self):
                print("Hello!")

        obj = MyClass()
        obj.say_hello()
  • Class Decorator

    • used to modify or enhance the behavior of a class

    • Applied to the class definition

    • work by taking a class as an argument and returning a modified version of the class

        def fun(cls):
            cls.class_name = cls.__name__
            return cls

        @fun
        class Person:
            pass

        print(Person.class_name)
  • Build-in decorators

    • Python provides built-in decorators that are commonly used in class definitions

    • Modify the behavior of the method and attributes in the class

    • most common

      • @staticmethod → used to define a method that doesn’t use self (don’t operate on an instance of a class)

      • called directly, not via an object

          #Eg:
          class MathOperations:
              @staticmethod
              def add(x, y):
                  return x + y
        
          # Using the static method
          res = MathOperations.add(5, 3)
          print(res)
        
      • @classmethod → used to define a method that operates on the class itself (uses cls)

        • can access and modify class state that applies across all instances of class
                class Employee:
                    raise_amount = 1.05

                    def __init__(self, name, salary):
                        self.name = name
                        self.salary = salary

                    @classmethod
                    def set_raise_amount(cls, amount):
                        cls.raise_amount = amount

                # Using the class method
                Employee.set_raise_amount(1.10)
                print(Employee.raise_amount)
  • @property → used to define method as property, allows you to access it like attribute

    • useful for encapsulating the implementation of a method while still providing a simple interface.
                class Circle:
                    def __init__(self, radius):
                        self._radius = radius

                    @property
                    def radius(self):
                        return self._radius

                    @radius.setter
                    def radius(self, value):
                        if value >= 0:
                            self._radius = value
                        else:
                            raise ValueError("Radius cannot be negative")

                    @property
                    def area(self):
                        return 3.14159 * (self._radius ** 2)

                # Using the property
                c = Circle(5)
                print(c.radius) 
                print(c.area)    
                c.radius = 10
                print(c.area)
  • Chaining Decorators

    • decorating function with multiple decorators
        # code for testing decorator chaining 
        def decor1(func): 
            def inner(): 
                x = func() 
                return x * x 
            return inner 

        def decor(func): 
            def inner(): 
                x = func() 
                return 2 * x 
            return inner 

        @decor1
        @decor
        def num(): 
            return 10

        @decor
        @decor1
        def num2():
            return 10

        print(num()) 
        print(num2())

Generator

  • Allows us to write a function that can send back a value and then later resume to pick up where it left

  • A special type of function that returns an iterator object

  • Instead of using return to send back a single value, it uses yield to produce a series of results over time

  • This allows the function to generate values and pause its execution after each yield, maintaining its state between iterations.

      #Eg:
      def fun(max):
          cnt = 1
          while cnt <= max:
              yield cnt
              cnt += 1
    
      ctr = fun(5)
      for n in ctr:
          print(n)
    
  • Why needed?

    • Handle large or infinite data without loading everything into memory

    • yield items one by one, avoiding full list creation

    • generating value only when needed → improve performance

    • Ideal for generating unbound data like the Fibonacci series

    • Chain generators to process data in stages efficiently

  • Creating generators

      def generator_function_name(parameters):
          # Your code here
          yield expression
          # Additional code can follow
    
YieldReturn
used in generator function to provide sequence of values over timeused to exit a function and return final value
when yield executes , it pauses function , return current value and retain state of functiononce return executed, the function is terminated immediately, no state retained
useful for generating large or complex sequence efficientlyuseful when single result is needed
  • Generator Expression

    • Concise way to create generators

    • similar to list comprehension, except this runs in ( , )

    • more memory efficient

    # Synax:
    (expression for item in iterable)

    # Eg:
    sq = (x*x for x in range(1, 6))
    for i in sq:
        print(i)
  • Usecases

    • processing large data files, like logs

    • Using a generator makes this easy, you just call next() to get the next number without worrying about the stream ending.

Collections Module

  • built-in module of Python

  • Implements specialized container data type → alternative to Python’s built-in containers that are general-purpose

  • Why needed?

    1. provides specialized container data types beyond built-in types like dict, list, and tuple

    2. include efficient alternative → deque, Counter, OrderedDict, defaultdict, and namedtuple

    3. simplifies complex data structure → cleaner and faster implementation

    4. ideal for improving performance and code readability in data-heavy applications

  • Counters

    • subclass of dictionary

    • It is used to keep the count of the elements in an iterable in the form of an unordered dictionary, where the key represents the element in the iterable and the value represents the count of that element in the iterable.

    • *class collections.Counter([iterable-or-mapping])*

  • OrderedDict

    • Dictionary that preserves the order in which keys are inserted

    • While regular dictionaries do this from Python 3.7+, OrderedDict also offers extra features like moving re-inserted keys to the end, making it useful for order-sensitive operations.

    • Syntax: *class collections.OrderDict()*

    from collections import OrderedDict 
    print("This is a Dict:\\n") 
    d = {} 
    d['a'] = 1
    d['b'] = 2
    d['c'] = 3
    d['d'] = 4

    for key, value in d.items(): 
        print(key, value) 

    print("\\nThis is an Ordered Dict:\\n") 
    od = OrderedDict() 
    od['a'] = 1
    od['b'] = 2
    od['c'] = 3
    od['d'] = 4

    for key, value in od.items(): 
        print(key, value)
  • DefaultDict

    • subclass of dictionary

    • used to provide some default values for keys that don’t exist and never raises a KeyError

    • Syntax: *class collections.defaultdict(default_factory)*

    from collections import defaultdict 

    # Creating a defaultdict with default value of 0 (int)
    d = defaultdict(int) 
    L = [1, 2, 3, 4, 2, 4, 1, 2] 

    # Counting occurrences of each element in the list
    for i in L: 
        d[i] += 1  # No need to check key existence; default is 0

    print(d)
  • ChainMap

    • encapsulates many dictionaries into a single unit and returns a list of dictionaries.

    • *class collections.ChainMap(dict1, dict2)*

    from collections import ChainMap 

    d1 = {'a': 1, 'b': 2}
    d2 = {'c': 3, 'd': 4}
    d3 = {'e': 5, 'f': 6}

    # Defining the chainmap 
    c = ChainMap(d1, d2, d3) 
    print(c)

    # OUTPUT: ChainMap({'a': 1, 'b': 2}, {'c': 3, 'd': 4}, {'e': 5, 'f': 6})
  • NamedTuple

    • a regular tuple but with named fields, making data more readable and accessible

    • Instead of using indexes, you can access elements by name

    • Syntax: *class collections.namedtuple(typename, field_names)*

    from collections import namedtuple

    # Declaring namedtuple() 
    Student = namedtuple('Student',['name','age','DOB']) 

    # Adding values 
    S = Student('Nandini','19','2541997') 

    # Access using index 
    print ("The Student age using index is : ",end ="") 
    print (S[1]) 

    # Access using name  
    print ("The Student name using keyname is : ",end ="") 
    print (S.name)
  • Conversion operations

    • _make() → used to return a namedtuple() from iterable passed as argument

    • _asdict() → returns OrderedDict() as constructed from mapped valuse of namedtuple()

  • Deque

    • Doubly Ended Queue

    • for quicker append and pop operations from both sides of the container

    • Time complexity for append and pop → O(1)

      • list has O(n)
    • Syntax: *class collections.deque(list)*

    from collections import deque

    # Declaring deque
    queue = deque(['name','age','DOB']) 
    print(queue)
  • Inserting element → appendleft(<element))

  • removing element → popleft()

  • UserDict

    • Dictionary-like container that acts as a wrapper around dictionary objects

    • Container used when someone wants to create their own dictionary with some modified or new functionality

    • *class collections.UserDict([initialdata])*

    from collections import UserDict 

    # Creating a dictionary where deletion is not allowed
    class MyDict(UserDict): 

        # Prevents using 'del' on dictionary
        def __del__(self): 
            raise RuntimeError("Deletion not allowed") 

        # Prevents using pop() on dictionary
        def pop(self, s=None): 
            raise RuntimeError("Deletion not allowed") 

        # Prevents using popitem() on dictionary
        def popitem(self, s=None): 
            raise RuntimeError("Deletion not allowed") 

    # Create an instance of MyDict
    d = MyDict({'a': 1, 'b': 2, 'c': 3})
    d.pop(1)
  • UserList

    • list-like container that acts as a wrapper around list objects

    • Useful when someone wants to create their own list with some modified or additional functionality

  • UserString

    • string-like container, and just like UserDict and UserList it acts as a wrapper around string objects

    • used when someone wants to create their own string with some modified or additional functionality

    • Syntax: *class collections.UserString(seq)*

    from collections import UserString 

    # Creating a Mutable String 
    class Mystring(UserString): 

        # Function to append to string
        def append(self, s): 
            self.data += s 

        # Function to remove from string 
        def remove(self, s): 
            self.data = self.data.replace(s, "") 

    # Driver's code 
    s1 = Mystring("Geeks") 
    print("Original String:", s1.data) 

    # Appending to string 
    s1.append("s") 
    print("String After Appending:", s1.data) 

    # Removing from string 
    s1.remove("e") 
    print("String after Removing:", s1.data)

Web scraping

  • A general term for automating the gathering of data from a website

  • browser loads a website, the user gets to see what is known as the “front-end” of the website

  • Grab data from html of the object and return

  • Rules of Web Scraping

    • Always try to get permission before scraping

    • If made to many scraping attempts or requests, your IP address could be blocked

    • Some sites block scraping software

  • Limitations

    • Every website is unique, which means every web scraping script is unique

    • Python can view these HTML and CSS elements programmatically, and then extract info from them

  • Libraries used are BeautifulSoup, Scrapy , Selenium

  • required libraries

    • requests → send HTTP requests to get webpages content (used for static sites)

        import requests
      
        response = requests.get('<https://www.geeksforgeeks.org/python/python-programming-language-tutorial/>')
      
        print(response.status_code)
      
        print(response.content)
      

      request.get (url) → sends GET request to given URL

      response.status_code → return HTTP status code

      response.content → returns raw HTML of pages in bytes

    • BeautifulSoup4 → parses and extract HTML content

        import requests
        from bs4 import BeautifulSoup
      
        response = requests.get('<https://www.geeksforgeeks.org/python/python-programming-language-tutorial/>')
      
        soup = BeautifulSoup(response.content, 'html.parser')
      
        print(soup.prettify())
      
      • output

helps convert raw HTML to a searchable tree of elements

  • BeautifulSoup(html, parser) → converts HTML into searchable object , html.parser isa built-in parser

  • soup.prettify() → formats HTML nicely for easier reading

  • soup.find('div', class_='article--viewer_content') → to find by particular element and tag

    • selenium→ automates browsers (needed for dynamic sites with JS)
  • WebDriver → software component Selenium uses to interact with the browser

    • bridge between Python and browser

    • Each browser has its own driver

Selenium uses this WebDriver to:

  • Open and control the browser

  • Load web pages

  • Extract elements

  • Simulate clicks, scrolls and inputs

  • *You can either manually download the WebDriver or use **webdriver-manager** which handles the download and setup automatically.*

        from selenium import webdriver
        from selenium.webdriver.common.by import By
        from selenium.webdriver.chrome.service import Service
        from webdriver_manager.chrome import ChromeDriverManager
        import time

        element_list = []

        # Set up Chrome options (optional)
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")  # Run in headless mode (optional)
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")

        # Use a proper Service object
        service = Service(ChromeDriverManager().install())

        for page in range(1, 3):
            # Initialize driver properly
            driver = webdriver.Chrome(service=service, options=options)

            # Load the URL
            url = f"<https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=%7Bpage%7D>"
            driver.get(url)
            time.sleep(2)  # Optional wait to ensure page loads

            # Extract product details
            titles = driver.find_elements(By.CLASS_NAME, "title")
            prices = driver.find_elements(By.CLASS_NAME, "price")
            descriptions = driver.find_elements(By.CLASS_NAME, "description")
            ratings = driver.find_elements(By.CLASS_NAME, "ratings")

            # Store results in a list
            for i in range(len(titles)):
                element_list.append([
                    titles[i].text,
                    prices[i].text,
                    descriptions[i].text,
                    ratings[i].text
                ])

            driver.quit()

        # Display extracted data
        for row in element_list:
            print(row)
  • ChromeOptions() + --headless: Runs the browser in the background without opening a visible window — ideal for automation and speed.

  • ChromeDriverManager().install(): Automatically downloads the correct version of ChromeDriver based on your Chrome browser.

  • Service(...): Wraps the ChromeDriver path for proper configuration with Selenium 4+.

  • webdriver.Chrome(service=..., options=...): Launches a Chrome browser instance with the given setup.

  • driver.get(url): Navigates to the specified page URL.

  • find_elements(By.CLASS_NAME, "class"): Extracts all elements matching the given class name like titles, prices, etc.

  • .text: Retrieves the visible text content from an HTML element.

  • element_list.append([...]): Stores each product's extracted data in a structured list.

  • driver.quit(): Closes the browser to free system resources.

    • lxml → fast HTML/XML parser, useful for large or complex pages
    from lxml import html
    import requests
    
    url = '<https://example.com/>'
    response = requests.get(url)
    tree = html.fromstring(response.content)
    
    # Extract all link texts
    link_titles = tree.xpath('//a/text()')
    
    for title in link_titles:
      print(title)
    
  • html.fromstring(): Parses HTML into an element tree.

  • tree.xpath(): Uses XPath to extract specific tags or data.

    • urllib
  • built-in library providing functions for working with URLs

  • allows you to interact with web pages by fetching URLs , opening and reading data from them and performing other URL-related tasks like encoding and parsing

  • urllib.request for opening and reading.

  • urllib.parse for parsing URLs

  • urllib.error for the exceptions raised

  • urllib.robotparser for parsing robot.txt files

        import urllib.request

        # URL of the web page to fetch
        url = '<https://www.example.com/>'

        try:
            response = urllib.request.urlopen(url)
            data = response.read()

            # Decode the data (if it's in bytes) to a string
            html_content = data.decode('utf-8')

            # Print the HTML content of the web page
            print(html_content)

        except Exception as e:
            print("Error fetching URL:", e)
  • schedule → lets you run scraping tasks repeatedly at fixed intervals

    • simple library that allows you to schedule Python functions to run at specified intervals
        import schedule 
        import time 

        def func(): 
            print("Geeksforgeeks") 

        schedule.every(1).minutes.do(func) 

        while True: 
            schedule.run_pending() 
            time.sleep(1)
  • schedule.every().minutes.do(): Schedules your function.

  • run_pending(): Checks if any job is due.

  • time.sleep(): Prevents the loop from hogging CPU.

    • pyautogui → Automates mouse and keyboard ; useful when dealign with UI =-based interaction
  • Simulate mouse and keyboard actions. It’s useful if elements aren’t reachable via Selenium like special pop-ups or custom scrollbars.

        import pyautogui

        # moves to (519,1060) in 1 sec
        pyautogui.moveTo(519, 1060, duration = 1)

        # simulates a click at the present mouse position 
        pyautogui.click()

        pyautogui.moveTo(1717, 352, duration = 1) 

        pyautogui.click()
0
Subscribe to my newsletter

Read articles from MRIDUL TIWARI directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

MRIDUL TIWARI
MRIDUL TIWARI

Software Engineer | Freelancer | Content Creator | Open Source Enthusiast | I Build Websites and Web Applications for Remote Clients.