Basics about pandas in Python

PrakharPrakhar
4 min read

1. Data Structures:

  • Series:

    A Series is a one-dimensional array that can hold any data type. It's like a column in a spreadsheet. In this example, we create a Series with numeric values, including a NaN (Not a Number) value.

    • Code:

        import pandas as pd
      
        # Creating a Series
        s = pd.Series([1, 3, 5, np.nan, 6, 8])
      
        # Displaying the Series
        print(s)
      
    • Output:

        0    1.0
        1    3.0
        2    5.0
        3    NaN
        4    6.0
        5    8.0
        dtype: float64
      
  • DataFrame:

    • A DataFrame is a two-dimensional table with labeled columns. In this example, we create a DataFrame with various data types and structures, including a timestamp, categorical data, and a constant value.

    • Code:

        import pandas as pd
        import numpy as np
      
        # Creating a DataFrame
        df = pd.DataFrame({
            'A': 1.0,
            'B': pd.Timestamp('20130102'),
            'C': pd.Series(1, index=list(range(4)), dtype='float32'),
            'D': np.array([3] * 4, dtype='int32'),
            'E': pd.Categorical(["test", "train", "test", "train"]),
            'F': 'foo'
        })
      
        # Displaying the DataFrame
        print(df)
      
    • Output:

           A        B        C   D     E    F
        0  1.0  2013-01-02  1.0  3   test  foo
        1  1.0  2013-01-02  1.0  3  train  foo
        2  1.0  2013-01-02  1.0  3   test  foo
        3  1.0  2013-01-02  1.0  3  train  foo
      

2. Basic Operations:

  • Head and Tail:

    • The head() method displays the first n rows of a DataFrame. In this case, we print the first 2 rows.

    • Code:

        pythonCopy code# Displaying the first 2 rows
        print(df.head(2))
      
    • Output:

            A       B        C   D     E    F
        0  1.0  2013-01-02  1.0  3   test  foo
        1  1.0  2013-01-02  1.0  3  train  foo
      
    • Descriptive Statistics:

      • The describe() method provides summary statistics for numeric columns, including count, mean, std (standard deviation), min, and max.
    • Code:

        # Displaying descriptive statistics
        print(df.describe())
      
    • Output:

               A    C    D
        count  4.0  4.0  4.0
        mean   1.0  1.0  3.0
        std    0.0  0.0  0.0
        min    1.0  1.0  3.0
        25%    1.0  1.0  3.0
        50%    1.0  1.0  3.0
        75%    1.0  1.0  3.0
        max    1.0  1.0  3.0
      

3. Data Manipulation:

  • Selection:

    • Columns can be selected using square bracket indexing (df['A']), and rows can be selected using loc with specific indices (df.loc[[0, 2]]).

    • Code:

        # Selecting column 'A'
        print(df['A'])
      
        # Selecting rows 0 and 2
        print(df.loc[[0, 2]])
      
    • Output:

        0    1.0
        Name: A, dtype: float64
      
           A          B    C  D     E    F
        0  1.0 2013-01-02  1.0  3  test  foo
        2  1.0 2013-01-02  1.0  3  test  foo
      
  • Filtering:

    • Data can be filtered based on conditions. In this example, we select rows where the 'B' column is greater than a specified date.

    • Code:

        pythonCopy code# Filtering rows where 'B' is greater than a certain date
        print(df[df['B'] > '2013-01-01'])
      
    • Output:

           A       B       C    D    E     F
        0  1.0 2013-01-02  1.0  3   test  foo
        1  1.0 2013-01-02  1.0  3  train  foo
        2  1.0 2013-01-02  1.0  3   test  foo
        3  1.0 2013-01-02  1.0  3  train  foo
      
  • Grouping:

    • The groupby() method is used to group data based on a column ('E' in this case). The mean() function is then applied to each group.

    • Code:

        # Grouping by column 'E' and calculating the mean of each group
        print(df.groupby('E').mean())
      
    • Output:

                A    C  D
        E
        test   1.0  1.0  3
        train  1.0  1.0  3
      

4. Data Cleaning:

  • Handling Missing Data:

    • The dropna() method is used to remove rows with any NaN values, effectively handling missing or incomplete data.

    • Code:

        # Dropping rows with any NaN values
        print(df.dropna())
      
    • Output:

            A      B        C   D     E    F
        0  1.0 2013-01-02  1.0  3   test  foo
        1  1.0 2013-01-02  1.0  3  train  foo
        2  1.0 2013-01-02  1.0  3   test  foo
        3  1.0 2013-01-02  1.0  3  train  foo
      
  • Filling Missing Data:

    • The fillna() method is used to fill NaN values with a specified value. In this example, we fill NaN values with 0.

    • Code:

        # Filling NaN values with 0
        print(df.fillna(0))
      
    • Output:

           A       B        C   D    E    F
        0  1.0 2013-01-02  1.0  3   test  foo
        1  1.0 2013-01-02  1.0  3  train  foo
        2  1.0 2013-01-02  1.0  3   test  foo
        3  1.0 2013-01-02  1.0  3  train  foo
      

5. File I/O:

  • Reading and Writing Data:

    • to_csv() writes the DataFrame to a CSV file, and read_csv() reads data from a CSV file into a new DataFrame. Similar functions exist for other file formats.

    • Code:

        # Filling NaN values with 0
        print(df.fillna(0))
      
    • Output (output.csv):

          A,B,C,D,E,F
        0,1.0,2013-01-02,1.0,3,test,foo
        1,1.0,2013-01-02,1.0,3,train,foo
        2,1.0,2013-01-02,1.0,3,test,foo
        3,1.0,2013-01-02,1.0,3,train,foo
      
0
Subscribe to my newsletter

Read articles from Prakhar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Prakhar
Prakhar

Here to explore ,learn & share!