price-scraper-crystal
Price Tracking Web Scraper and HTTP Server
This project implements a simple web scraper and HTTP server to track the price history of the "AW SuperFast Roadster" bicycle from the AW Bicycles website. Below is a detailed breakdown of the decisions, architecture, and resources used to complete this assignment.
Overview
The application consists of:
- A scraper that retrieves the price of the product from the AW Bicycles website every 60 seconds.
- A SQLite database to store the product details and scraped prices.
- An HTTP server using the Kemal framework to serve an index page displaying the price history as a line graph and a button to manually trigger a price scrape.
- A set of APIs to access logs, products with their scraped prices, and to manually trigger the scraper.

Architecture Decisions
Web Scraper
Library Choice
- Decision: The scraper uses Crystal's built-in
HTTP::Clientfor making HTTP requests and theXMLmodule for parsing HTML via XPath. - Reasoning: While third-party libraries such as Crystagiri and Mechanize.cr are available, I choses the built-in libraries for sufficient functionality and better flexibility.
- Implementation Details:
- HTTP requests are handled using
HTTP::Clientby Crystal. - HTML parsing and scraping are achieved using the
XMLmodule by Crystal, leveraging XPath queries to extract the price data efficiently.
- HTTP requests are handled using
Price Extraction Explaination
To scrape data effectively, the first step is to identify a selector or pattern in the HTML that remains relatively stable, such as an ID, attribute, or other consistent markers. While exploring the HTML elements for the product, I discovered that the product card includes a data-asin attribute. This attribute is a reliable identifier to link the scraped data with our product. Based on this observation, I decided to include asin in the table schema as well which will act as unique identifer for the product.
HTTP Server
Framework Choice
- Decision: I choses The Kemal framework as compaore to Amber.
- Reasoning: Reason behind that because Kemal is a lightweight framework, well-suited for the limited requirements of this project. While Amber provides a more feature-rich environment, it introduces unnecessary complexity for this simple scraper.
ORM Integration
- Decision: I choses the Granite ORM to handle database interactions.
- Reasoning: Granite simplifies database operations. Additionally, Granite is used by the Amber framework, which ensures a well-tested and robust integration with Crystal. Using Granite makes type casting and handling relationships between tables straightforward. I have declared the relationship between the product and the scraped_prices table inside the models but I am not using that because I did not yet check the performance of this like what whether it's query complexity is
N+1 query. So I coded my own function to fetch the product and prices in total two queries. I did not uses the left join to avoid the duplicate data bandwidth issue, so instead in 1 call I am getting the products and second call to fetch the all products prices fetched in first call. Because in our case we have lot's of scraped_price rows so to avoid duplicate data bandwidth I choses the two calls approach. I coded the scraper in a way that it can scrape severa products price not only one. - Implementation Details:
-
Automatic type casting ensures that database values seamlessly integrate with Crystal's strong typing system. Because Crystal lanugage is a statically typed language and we have to cast the variables to proper type.
-
For the sake of this assignment, I did not use any database migration module, such as
micrate. For a real product, we can utilizemicrate.
-
Database Design
Table Schema
-
Tables:
- Products: Stores product metadata. The asin will act as unique identifier for the product.
- Scraped Prices: Tracks price data with timestamps linked to the product.
-
Schema:
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
asin TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
image_url TEXT,
description TEXT
);
CREATE TABLE IF NOT EXISTS scraped_prices (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_id INTEGER NOT NULL,
price REAL NOT NULL,
scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
FOREIGN KEY (product_id) REFERENCES products (id)
);
- Table Design Explaination:
- Separation of Concerns: Using two tables enables efficient normalization and minimizes redundancy.
- Future Expandability: The
productstable allows for potential tracking of additional products without altering the schema.
APIs
/api/logs/latest
- Description: Retrieves the latest 1000 log records in descending order.
- Response Example:
[
"[2025-01-14 23:48:21 +00:00] [INFO ] ------ Scraper task completed successfully --------",
"[2025-01-14 23:48:21 +00:00] [INFO ] Price successfully saved on db for AW SuperFast Roadster",
"[2025-01-14 23:48:21 +00:00] [INFO ] Price successfully extracted for AW SuperFast Roadster is 1272.73",
"[2025-01-14 23:48:21 +00:00] [INFO ] Source code loaded successfully",
"[2025-01-14 23:48:19 +00:00] [INFO ] Getting source code of https://bush-daisy-tellurium.glitch.me/",
"[2025-01-14 23:48:19 +00:00] [INFO ] ----- Starting scheduled scraper task -------"
]
/api/products/all-with-prices
- Description: Retrieves all products with their scraped price history.
- Response Example:
[
{
"product": {
"id": 1,
"asin": "0123456789",
"title": "AW SuperFast Roadster",
"image_url": "https://images.unsplash.com/photo-1485965120184-e220f721d03e?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"description": "The AW SuperFast Roadster features a unique carbon/magnesium frame that not only goes fast, but gives you bragging points at your next group ride."
},
"prices": [
{
"price": 1278.65,
"scraped_at": "2025-01-14T19:54:30Z"
},
{
"price": 1202.04,
"scraped_at": "2025-01-14T19:53:16Z"
},
{
"price": 1234.2,
"scraped_at": "2025-01-14T19:52:18Z"
},
{
"price": 1235.25,
"scraped_at": "2025-01-14T19:52:12Z"
},
{
"price": 1266.25,
"scraped_at": "2025-01-14T19:52:05Z"
}
]
}
]
/api/scraper/trigger
- Description: Triggers the price scraping process in the background.
- Response: A success message indicating the scraper task was initiated.
How to Run the Application
Using Docker Compose
The application uses Docker Compose for an isolated and consistent runtime environment.
- Ensure Docker and Docker Compose are installed on your machine.
- Build and start the application:
docker-compose up --build -d - Access the application at
http://localhost:3011.
Environment Variables
The following environment variables are used to configure the application:
DATABASE_PATH: Specifies the path to the SQLite database file inside the container. Default:/app/db/database.db.SCRAPER_URL: The URL of the website to scrape for product prices. Example:https://bush-daisy-tellurium.glitch.me/.LOG_ADAPTER: Specifies the logging output. Options:console: Logs are printed to the console.file: Logs are saved to a file.
LOG_INFO,LOG_DEBUG,LOG_ERROR: Enable or disable specific log levels. Acceptable values areTRUEorFALSE.SCRAPER_INTERVAL: Sets the interval (in seconds) for the automatic scraper task. Example:60for every 60 seconds.
Volumes
- The database, source code, and public files are mapped as volumes to persist changes and facilitate development.
Resources Used to Learn Crystal
- Official Crystal Documentation: I have learned all the language syntax,
Variables,HTTP::Client,Array,Hash,XMLetc. - Kemal Documentation: I have leraned from this the way to setup the server, serving of the static files and declaring the routes.
- Granite ORM Documentation: For understanding the models, custom view and interaction with the models.
Deployment Strategy
While this application is designed for local use, deployment to a server can be achieved as follows:
- Server Setup:
- Use a Linux-based server with Docker and Docker Compose installed.
- Clone the repository and configure environment variables.
- Reverse Proxy:
- Use Nginx or Apache to reverse proxy requests to the application.
- Process Management:
- Use Docker Compose to manage the application lifecycle.
- Scaling:
- For higher loads, consider orchestrating with Kubernetes.
High-Level Overview of Files
The project is structured following an MVC-like approach, focusing on a clean separation of concerns. Below is an overview of the key folders and files.
Server Files Overview
src/
The root folder contains all the core components of the application.
-
src/config/- Contains application configuration settings.
- Files:
config.cr: Module that retrieves application variables from environment variables with default fallback values.
-
src/controllers/- Contains controllers that handle the logic for different parts of the application.
- Files:
logs_controller.cr: Handles API requests related to fetching logs.products_controller.cr: Handles API requests related to products and their prices.scraper_controller.cr: Handles API requests for triggering the scraper.
-
src/db/- Contains all database-related components.
- Subfolders and Files:
models/: Defines the database models for interacting with the database.product.cr: Model for theproductstable.scraped_price.cr: Model for thescraped_pricestable.
init_db.cr: Initializes the SQLite database connection.product_price_db.cr: Contains high-level functions to interact with the models, including retrieving and adding data.schema.sql: SQL file defining the database schema and initial data.
-
src/debug/- Contains logging-related components.
- Files:
logger.cr: Provides a logging utility to process and format application logs.
-
src/request/- Contains helper modules for handling HTTP responses.
- Files:
response.cr: Module to construct and format JSON responses, such as200,400, or500HTTP status codes.
-
src/routes/- Contains the routing definitions for the application, mapping API routes to their respective controllers.
- Files:
logs_router.cr: Defines routes for log-related APIs.products_router.cr: Defines routes for product-related APIs.scraper_router.cr: Defines routes for scraper-related APIs.
-
src/scraper/- Contains all components related to the scraping functionality.
- Subfolders and Files:
http_client/http_client.cr: Module for making HTTP requests to fetch the source code of web pages.
scheduler/task_scheduler.cr: Module (ScraperTaskScheduler) that registers and manages the periodic execution of scrapers. Includes astartfunction to initialize tasks.
services/bike_price_scraper.cr: Module for scraping prices from external websites. It includes:scrape_prices: Function for fetching and parsing product prices usinghttp_client.crand XML parsing via XPath.schedule_scraper: Function to enable periodic price scraping based on a configurable interval.
-
src/server.cr- The entry point of the application.
- Responsibilities:
- Initializes the Kemal server and registers all routes (e.g.,
logs_router.cr,products_router.cr,scraper_router.cr). - Calls
ScraperTaskScheduler.start()to initialize the scraper's scheduled tasks. - Serves the index view on the root (
/) route.
- Initializes the Kemal server and registers all routes (e.g.,
Public Files Overview
public: Folder containing all the static files for rendering our index page.
price-scraper-crystal
- 0
- 0
- 0
- 0
- 3
- 11 months ago
- January 15, 2025
Tue, 16 Dec 2025 06:44:37 GMT