AI-Based Data Transformation: A Comparison of LLM-Generated PySpark Code (Using Mistral & Google Gemini Advanced)
Mistral Le Chat
Mistral's Le Chat didn't have the functionality to upload CSV files, so I couldn't generate a transformation code.
I could have built a lightweight ingestion process using Mistral API and Langchain. However, that was out of scope for the context of this blog post.
Google Gemini Advanced
Next, we move on to Google's Large Language Models. Google's Gemini didn't allow CSV file upload, so I had to upgrade my Google Gemini account to an Advanced subscription. Good thing they gave me a 1-month free trial to test this out. Otherwise, I was looking at spending $19.99 per month subscription.
Preselected Data and Use Case
We used the same 3 CSV file formats below to generate code for processing ERCOT's "2-Day Ancillary Services Reports."
NP3-959-EX: 48-hour Aggregate AS Offers
NP3-960-EX: 48-hour Self-Arranged AS
NP3-961-EX: 48-hour Cleared DAM AS
The task is to ingest and transform this data using Azure Synapse Analytics and PySpark to derive meaningful insights.
Prompt Used
We have a folder with Ercot's "2-Day Ancillary Services Reports" This report contains all 48 Hour disclosure data related to DAM. The following individual files are included in the report:NP3-959-EX 48-hour Aggregate AS Offers; NP3-960-EX 48-hour Self-Arranged AS; NP3-961-EX 48-hour Cleared DAM AS (previously named 48 Hour Ancillary Services Reports).
We gathered daily data as a zipped file containing multiple CSVs inside the Zipped file for the month of April. I have added all the zipped files in a master folder. So now there is a Master folder with a 30 days zipped file. Each zipped file has multiple CSV with data listed above.
Can you give me step by step procedure to do the following:
1) Ingest the CSV file within zipped folders using PySpark
2) filter outNP3-960-EX 48-hour Self-Arranged AS
andNP3-961-EX 48-hour Cleared DAM AS
3) Ingest the file in the SQL database that I recently set up on my Azure.
Results
Google Gemini Advanced failed to perform a simple data transformation coding task that was easily done by GPT-4, Claude 3.5, and Llama-3-70B.
Here is the result that I received from the prompt:
It seems that the PySpark library is not available. We will use pandas to read the csv files and then convert the dataframes to spark dataframes. We can then filter out the dataframes for 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv
and 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv
and proceed.
Given that the files 2d_Agg_AS_Offers_ECRSM-01-APR-24.csv
, 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv
, and 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv
are present in the current directory, we will extract and read these directly.
It seems that the PySpark library is not available. We will use pandas to read the csv files and filter out the dataframes for 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv
and 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv
. Since we cannot use PySpark to ingest the data into a SQL database, we will provide general instructions on how to achieve this using common libraries and approaches.
Given that the files 2d_Agg_AS_Offers_ECRSM-01-APR-24.csv
, 2d_Cleared_DAM_AS_ECRSM-01-APR-24.csv
, and 2d_Self_Arranged_AS_ECRSM-01-APR-24.csv
are present in the current directory, we will extract and read these directly.
There appears to be no folder named 'Master_Folder' in the current directory. Please ensure the folder exists and re-run the code.
To ingest the filtered data into a SQL database (since PySpark is not available), you can use libraries like pandas
and sqlalchemy
. Here's a general approach:
Establish a Connection:
- Use
sqlalchemy
to create an engine specifying your database credentials and type (e.g., 'mssql' for Azure SQL).
Define Table Schema (If Not Exists):
- If the table doesn't exist in your database, define its schema (column names and data types) based on your DataFrame.
Ingest Data:
- Use the
df.to
_sql()
method inpandas
to write the DataFrame into the SQL table. Specify the table name, engine, and optionally, how to handle existing data (append, replace, etc.).
Example Code Snippet:
from sqlalchemy import create_engine
# ... (previous code to read and filter DataFrames)
# Establish connection (replace placeholders with your credentials)
engine = create_engine('mssql+pyodbc://your_username:your_password@your_server_name/your_database_name?driver=ODBC+Driver+17+for+SQL+Server')
# Ingest data (replace 'your_table_name' with the desired table name)
for df in filtered_dfs:
df.to_sql('your_table_name', engine, if_exists='append', index=False)
Key Points:
Replace placeholders with your actual Azure SQL database credentials.
Install required libraries (
sqlalchemy
,pyodbc
) if not already installed.Adjust the
if_exists
parameter into_sql()
based on your preference (e.g., 'replace' to overwrite existing data).Consider batching data insertion for large datasets to optimize performance.
Conclusion
Google's Gemini Advanced truly disappointed me with their results. While all other models gave the proper instructions with code, albeit, with some errors, Google failed to give the results.
I understand that they might not have implemented the Spark library installer in their compiler. Plus code generation might not be Google Gemini Advanced's key use case, but Google has been disappointing us with its AI over the months.
In a nutshell, I am canceling my $19.99 per month subscription.
Subscribe to my newsletter
Read articles from Turboline LTD directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Turboline LTD
Turboline LTD
Turboline specializes in AI-driven data engineering from integration to analytics. We transform complex data into actionable insights, empowering businesses with advanced data solutions.