price-scraper-crystal
Price Tracking Web Scraper and HTTP Server
This project implements a simple web scraper and HTTP server to track the price history of the "AW SuperFast Roadster" bicycle from the AW Bicycles website. Below is a detailed breakdown of the decisions, architecture, and resources used to complete this assignment.
Overview
The application consists of:
- A scraper that retrieves the price of the product from the AW Bicycles website every 60 seconds.
- A SQLite database to store the product details and scraped prices.
- An HTTP server using the Kemal framework to serve an index page displaying the price history as a line graph and a button to manually trigger a price scrape.
- A set of APIs to access logs, products with their scraped prices, and to manually trigger the scraper.
Architecture Decisions
Web Scraper
Library Choice
- Decision: The scraper uses Crystal's built-in
HTTP::Client
for making HTTP requests and theXML
module for parsing HTML via XPath. - Reasoning: While third-party libraries such as Crystagiri and Mechanize.cr are available, I choses the built-in libraries for sufficient functionality and better flexibility.
- Implementation Details:
- HTTP requests are handled using
HTTP::Client
by Crystal. - HTML parsing and scraping are achieved using the
XML
module by Crystal, leveraging XPath queries to extract the price data efficiently.
- HTTP requests are handled using
Price Extraction Explaination
To scrape data effectively, the first step is to identify a selector or pattern in the HTML that remains relatively stable, such as an ID, attribute, or other consistent markers. While exploring the HTML elements for the product, I discovered that the product card includes a data-asin
attribute. This attribute is a reliable identifier to link the scraped data with our product. Based on this observation, I decided to include asin
in the table schema as well which will act as unique identifer for the product.
HTTP Server
Framework Choice
- Decision: I choses The Kemal framework as compaore to Amber.
- Reasoning: Reason behind that because Kemal is a lightweight framework, well-suited for the limited requirements of this project. While Amber provides a more feature-rich environment, it introduces unnecessary complexity for this simple scraper.
ORM Integration
- Decision: I choses the Granite ORM to handle database interactions.
- Reasoning: Granite simplifies database operations. Additionally, Granite is used by the Amber framework, which ensures a well-tested and robust integration with Crystal. Using Granite makes type casting and handling relationships between tables straightforward. I have declared the relationship between the product and the scraped_prices table inside the models but I am not using that because I did not yet check the performance of this like what whether it's query complexity is
N+1 query
. So I coded my own function to fetch the product and prices in total two queries. I did not uses the left join to avoid the duplicate data bandwidth issue, so instead in 1 call I am getting the products and second call to fetch the all products prices fetched in first call. Because in our case we have lot's of scraped_price rows so to avoid duplicate data bandwidth I choses the two calls approach. I coded the scraper in a way that it can scrape severa products price not only one. - Implementation Details:
-
Automatic type casting ensures that database values seamlessly integrate with Crystal's strong typing system. Because Crystal lanugage is a statically typed language and we have to cast the variables to proper type.
-
For the sake of this assignment, I did not use any database migration module, such as
micrate
. For a real product, we can utilizemicrate
.
-
Database Design
Table Schema
-
Tables:
- Products: Stores product metadata. The asin will act as unique identifier for the product.
- Scraped Prices: Tracks price data with timestamps linked to the product.
-
Schema:
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
asin TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
image_url TEXT,
description TEXT
);
CREATE TABLE IF NOT EXISTS scraped_prices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_id INTEGER NOT NULL,
price REAL NOT NULL,
scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
FOREIGN KEY (product_id) REFERENCES products (id)
);
- Table Design Explaination:
- Separation of Concerns: Using two tables enables efficient normalization and minimizes redundancy.
- Future Expandability: The
products
table allows for potential tracking of additional products without altering the schema.
APIs
/api/logs/latest
- Description: Retrieves the latest 1000 log records in descending order.
- Response Example:
[
"[2025-01-14 23:48:21 +00:00] [INFO ] ------ Scraper task completed successfully --------",
"[2025-01-14 23:48:21 +00:00] [INFO ] Price successfully saved on db for AW SuperFast Roadster",
"[2025-01-14 23:48:21 +00:00] [INFO ] Price successfully extracted for AW SuperFast Roadster is 1272.73",
"[2025-01-14 23:48:21 +00:00] [INFO ] Source code loaded successfully",
"[2025-01-14 23:48:19 +00:00] [INFO ] Getting source code of https://bush-daisy-tellurium.glitch.me/",
"[2025-01-14 23:48:19 +00:00] [INFO ] ----- Starting scheduled scraper task -------"
]
/api/products/all-with-prices
- Description: Retrieves all products with their scraped price history.
- Response Example:
[
{
"product": {
"id": 1,
"asin": "0123456789",
"title": "AW SuperFast Roadster",
"image_url": "https://images.unsplash.com/photo-1485965120184-e220f721d03e?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"description": "The AW SuperFast Roadster features a unique carbon/magnesium frame that not only goes fast, but gives you bragging points at your next group ride."
},
"prices": [
{
"price": 1278.65,
"scraped_at": "2025-01-14T19:54:30Z"
},
{
"price": 1202.04,
"scraped_at": "2025-01-14T19:53:16Z"
},
{
"price": 1234.2,
"scraped_at": "2025-01-14T19:52:18Z"
},
{
"price": 1235.25,
"scraped_at": "2025-01-14T19:52:12Z"
},
{
"price": 1266.25,
"scraped_at": "2025-01-14T19:52:05Z"
}
]
}
]
/api/scraper/trigger
- Description: Triggers the price scraping process in the background.
- Response: A success message indicating the scraper task was initiated.
How to Run the Application
Using Docker Compose
The application uses Docker Compose for an isolated and consistent runtime environment.
- Ensure Docker and Docker Compose are installed on your machine.
- Build and start the application:
docker-compose up --build -d
- Access the application at
http://localhost:3011
.
Environment Variables
The following environment variables are used to configure the application:
DATABASE_PATH
: Specifies the path to the SQLite database file inside the container. Default:/app/db/database.db
.SCRAPER_URL
: The URL of the website to scrape for product prices. Example:https://bush-daisy-tellurium.glitch.me/
.LOG_ADAPTER
: Specifies the logging output. Options:console
: Logs are printed to the console.file
: Logs are saved to a file.
LOG_INFO
,LOG_DEBUG
,LOG_ERROR
: Enable or disable specific log levels. Acceptable values areTRUE
orFALSE
.SCRAPER_INTERVAL
: Sets the interval (in seconds) for the automatic scraper task. Example:60
for every 60 seconds.
Volumes
- The database, source code, and public files are mapped as volumes to persist changes and facilitate development.
Resources Used to Learn Crystal
- Official Crystal Documentation: I have learned all the language syntax,
Variables
,HTTP::Client
,Array
,Hash
,XML
etc. - Kemal Documentation: I have leraned from this the way to setup the server, serving of the static files and declaring the routes.
- Granite ORM Documentation: For understanding the models, custom view and interaction with the models.
Deployment Strategy
While this application is designed for local use, deployment to a server can be achieved as follows:
- Server Setup:
- Use a Linux-based server with Docker and Docker Compose installed.
- Clone the repository and configure environment variables.
- Reverse Proxy:
- Use Nginx or Apache to reverse proxy requests to the application.
- Process Management:
- Use Docker Compose to manage the application lifecycle.
- Scaling:
- For higher loads, consider orchestrating with Kubernetes.
High-Level Overview of Files
The project is structured following an MVC-like approach, focusing on a clean separation of concerns. Below is an overview of the key folders and files.
Server Files Overview
src/
The root folder contains all the core components of the application.
-
src/config/
- Contains application configuration settings.
- Files:
config.cr
: Module that retrieves application variables from environment variables with default fallback values.
-
src/controllers/
- Contains controllers that handle the logic for different parts of the application.
- Files:
logs_controller.cr
: Handles API requests related to fetching logs.products_controller.cr
: Handles API requests related to products and their prices.scraper_controller.cr
: Handles API requests for triggering the scraper.
-
src/db/
- Contains all database-related components.
- Subfolders and Files:
models/
: Defines the database models for interacting with the database.product.cr
: Model for theproducts
table.scraped_price.cr
: Model for thescraped_prices
table.
init_db.cr
: Initializes the SQLite database connection.product_price_db.cr
: Contains high-level functions to interact with the models, including retrieving and adding data.schema.sql
: SQL file defining the database schema and initial data.
-
src/debug/
- Contains logging-related components.
- Files:
logger.cr
: Provides a logging utility to process and format application logs.
-
src/request/
- Contains helper modules for handling HTTP responses.
- Files:
response.cr
: Module to construct and format JSON responses, such as200
,400
, or500
HTTP status codes.
-
src/routes/
- Contains the routing definitions for the application, mapping API routes to their respective controllers.
- Files:
logs_router.cr
: Defines routes for log-related APIs.products_router.cr
: Defines routes for product-related APIs.scraper_router.cr
: Defines routes for scraper-related APIs.
-
src/scraper/
- Contains all components related to the scraping functionality.
- Subfolders and Files:
http_client/
http_client.cr
: Module for making HTTP requests to fetch the source code of web pages.
scheduler/
task_scheduler.cr
: Module (ScraperTaskScheduler
) that registers and manages the periodic execution of scrapers. Includes astart
function to initialize tasks.
services/
bike_price_scraper.cr
: Module for scraping prices from external websites. It includes:scrape_prices
: Function for fetching and parsing product prices usinghttp_client.cr
and XML parsing via XPath.schedule_scraper
: Function to enable periodic price scraping based on a configurable interval.
-
src/server.cr
- The entry point of the application.
- Responsibilities:
- Initializes the Kemal server and registers all routes (e.g.,
logs_router.cr
,products_router.cr
,scraper_router.cr
). - Calls
ScraperTaskScheduler.start()
to initialize the scraper's scheduled tasks. - Serves the index view on the root (
/
) route.
- Initializes the Kemal server and registers all routes (e.g.,
Public Files Overview
public
: Folder containing all the static files for rendering our index page.
price-scraper-crystal
- 0
- 0
- 0
- 0
- 3
- 3 months ago
- January 15, 2025
Fri, 18 Apr 2025 03:24:52 GMT