price-scraper-crystal

Price scraper using crystal language

Price Tracking Web Scraper and HTTP Server

This project implements a simple web scraper and HTTP server to track the price history of the "AW SuperFast Roadster" bicycle from the AW Bicycles website. Below is a detailed breakdown of the decisions, architecture, and resources used to complete this assignment.

Overview

The application consists of:

  • A scraper that retrieves the price of the product from the AW Bicycles website every 60 seconds.
  • A SQLite database to store the product details and scraped prices.
  • An HTTP server using the Kemal framework to serve an index page displaying the price history as a line graph and a button to manually trigger a price scrape.
  • A set of APIs to access logs, products with their scraped prices, and to manually trigger the scraper.

Screenshot


Architecture Decisions

Web Scraper

Library Choice

  • Decision: The scraper uses Crystal's built-in HTTP::Client for making HTTP requests and the XML module for parsing HTML via XPath.
  • Reasoning: While third-party libraries such as Crystagiri and Mechanize.cr are available, I choses the built-in libraries for sufficient functionality and better flexibility.
  • Implementation Details:
    • HTTP requests are handled using HTTP::Client by Crystal.
    • HTML parsing and scraping are achieved using the XML module by Crystal, leveraging XPath queries to extract the price data efficiently.

Price Extraction Explaination

To scrape data effectively, the first step is to identify a selector or pattern in the HTML that remains relatively stable, such as an ID, attribute, or other consistent markers. While exploring the HTML elements for the product, I discovered that the product card includes a data-asin attribute. This attribute is a reliable identifier to link the scraped data with our product. Based on this observation, I decided to include asin in the table schema as well which will act as unique identifer for the product.

HTTP Server

Framework Choice

  • Decision: I choses The Kemal framework as compaore to Amber.
  • Reasoning: Reason behind that because Kemal is a lightweight framework, well-suited for the limited requirements of this project. While Amber provides a more feature-rich environment, it introduces unnecessary complexity for this simple scraper.

ORM Integration

  • Decision: I choses the Granite ORM to handle database interactions.
  • Reasoning: Granite simplifies database operations. Additionally, Granite is used by the Amber framework, which ensures a well-tested and robust integration with Crystal. Using Granite makes type casting and handling relationships between tables straightforward. I have declared the relationship between the product and the scraped_prices table inside the models but I am not using that because I did not yet check the performance of this like what whether it's query complexity is N+1 query. So I coded my own function to fetch the product and prices in total two queries. I did not uses the left join to avoid the duplicate data bandwidth issue, so instead in 1 call I am getting the products and second call to fetch the all products prices fetched in first call. Because in our case we have lot's of scraped_price rows so to avoid duplicate data bandwidth I choses the two calls approach. I coded the scraper in a way that it can scrape severa products price not only one.
  • Implementation Details:
    • Automatic type casting ensures that database values seamlessly integrate with Crystal's strong typing system. Because Crystal lanugage is a statically typed language and we have to cast the variables to proper type.

    • For the sake of this assignment, I did not use any database migration module, such as micrate. For a real product, we can utilize micrate.

Database Design

Table Schema

  • Tables:

    1. Products: Stores product metadata. The asin will act as unique identifier for the product.
    2. Scraped Prices: Tracks price data with timestamps linked to the product.
  • Schema:

CREATE TABLE IF NOT EXISTS products (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    asin TEXT UNIQUE NOT NULL,
    title TEXT NOT NULL,
    image_url TEXT,
    description TEXT
);

CREATE TABLE IF NOT EXISTS scraped_prices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    product_id INTEGER NOT NULL,
    price REAL NOT NULL,
    scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
    FOREIGN KEY (product_id) REFERENCES products (id)
);
  • Table Design Explaination:
    • Separation of Concerns: Using two tables enables efficient normalization and minimizes redundancy.
    • Future Expandability: The products table allows for potential tracking of additional products without altering the schema.

APIs

/api/logs/latest

  • Description: Retrieves the latest 1000 log records in descending order.
  • Response Example:
[
  "[2025-01-14 23:48:21 +00:00] [INFO ] ------ Scraper task completed successfully --------",
  "[2025-01-14 23:48:21 +00:00] [INFO ] Price successfully saved on db for AW SuperFast Roadster",
  "[2025-01-14 23:48:21 +00:00] [INFO ] Price successfully extracted for AW SuperFast Roadster is 1272.73",
  "[2025-01-14 23:48:21 +00:00] [INFO ] Source code loaded successfully",
  "[2025-01-14 23:48:19 +00:00] [INFO ] Getting source code of https://bush-daisy-tellurium.glitch.me/",
  "[2025-01-14 23:48:19 +00:00] [INFO ] ----- Starting scheduled scraper task -------"
]

/api/products/all-with-prices

  • Description: Retrieves all products with their scraped price history.
  • Response Example:
[
  {
    "product": {
      "id": 1,
      "asin": "0123456789",
      "title": "AW SuperFast Roadster",
      "image_url": "https://images.unsplash.com/photo-1485965120184-e220f721d03e?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
      "description": "The AW SuperFast Roadster features a unique carbon/magnesium frame that not only goes fast, but gives you bragging points at your next group ride."
    },
    "prices": [
      {
        "price": 1278.65,
        "scraped_at": "2025-01-14T19:54:30Z"
      },
      {
        "price": 1202.04,
        "scraped_at": "2025-01-14T19:53:16Z"
      },
      {
        "price": 1234.2,
        "scraped_at": "2025-01-14T19:52:18Z"
      },
      {
        "price": 1235.25,
        "scraped_at": "2025-01-14T19:52:12Z"
      },
      {
        "price": 1266.25,
        "scraped_at": "2025-01-14T19:52:05Z"
      }
    ]
  }
]

/api/scraper/trigger

  • Description: Triggers the price scraping process in the background.
  • Response: A success message indicating the scraper task was initiated.

How to Run the Application

Using Docker Compose

The application uses Docker Compose for an isolated and consistent runtime environment.

  1. Ensure Docker and Docker Compose are installed on your machine.
  2. Build and start the application:
    docker-compose up --build -d
    
  3. Access the application at http://localhost:3011.

Environment Variables

The following environment variables are used to configure the application:

  • DATABASE_PATH: Specifies the path to the SQLite database file inside the container. Default: /app/db/database.db.
  • SCRAPER_URL: The URL of the website to scrape for product prices. Example: https://bush-daisy-tellurium.glitch.me/.
  • LOG_ADAPTER: Specifies the logging output. Options:
    • console: Logs are printed to the console.
    • file: Logs are saved to a file.
  • LOG_INFO, LOG_DEBUG, LOG_ERROR: Enable or disable specific log levels. Acceptable values are TRUE or FALSE.
  • SCRAPER_INTERVAL: Sets the interval (in seconds) for the automatic scraper task. Example: 60 for every 60 seconds.

Volumes

  • The database, source code, and public files are mapped as volumes to persist changes and facilitate development.

Resources Used to Learn Crystal


Deployment Strategy

While this application is designed for local use, deployment to a server can be achieved as follows:

  1. Server Setup:
    • Use a Linux-based server with Docker and Docker Compose installed.
    • Clone the repository and configure environment variables.
  2. Reverse Proxy:
    • Use Nginx or Apache to reverse proxy requests to the application.
  3. Process Management:
    • Use Docker Compose to manage the application lifecycle.
  4. Scaling:
    • For higher loads, consider orchestrating with Kubernetes.

High-Level Overview of Files

The project is structured following an MVC-like approach, focusing on a clean separation of concerns. Below is an overview of the key folders and files.

Server Files Overview

src/

The root folder contains all the core components of the application.

  1. src/config/

    • Contains application configuration settings.
    • Files:
      • config.cr: Module that retrieves application variables from environment variables with default fallback values.
  2. src/controllers/

    • Contains controllers that handle the logic for different parts of the application.
    • Files:
      • logs_controller.cr: Handles API requests related to fetching logs.
      • products_controller.cr: Handles API requests related to products and their prices.
      • scraper_controller.cr: Handles API requests for triggering the scraper.
  3. src/db/

    • Contains all database-related components.
    • Subfolders and Files:
      • models/: Defines the database models for interacting with the database.
        • product.cr: Model for the products table.
        • scraped_price.cr: Model for the scraped_prices table.
      • init_db.cr: Initializes the SQLite database connection.
      • product_price_db.cr: Contains high-level functions to interact with the models, including retrieving and adding data.
      • schema.sql: SQL file defining the database schema and initial data.
  4. src/debug/

    • Contains logging-related components.
    • Files:
      • logger.cr: Provides a logging utility to process and format application logs.
  5. src/request/

    • Contains helper modules for handling HTTP responses.
    • Files:
      • response.cr: Module to construct and format JSON responses, such as 200, 400, or 500 HTTP status codes.
  6. src/routes/

    • Contains the routing definitions for the application, mapping API routes to their respective controllers.
    • Files:
      • logs_router.cr: Defines routes for log-related APIs.
      • products_router.cr: Defines routes for product-related APIs.
      • scraper_router.cr: Defines routes for scraper-related APIs.
  7. src/scraper/

    • Contains all components related to the scraping functionality.
    • Subfolders and Files:
      • http_client/
        • http_client.cr: Module for making HTTP requests to fetch the source code of web pages.
      • scheduler/
        • task_scheduler.cr: Module (ScraperTaskScheduler) that registers and manages the periodic execution of scrapers. Includes a start function to initialize tasks.
      • services/
        • bike_price_scraper.cr: Module for scraping prices from external websites. It includes:
          • scrape_prices: Function for fetching and parsing product prices using http_client.cr and XML parsing via XPath.
          • schedule_scraper: Function to enable periodic price scraping based on a configurable interval.
  8. src/server.cr

    • The entry point of the application.
    • Responsibilities:
      • Initializes the Kemal server and registers all routes (e.g., logs_router.cr, products_router.cr, scraper_router.cr).
      • Calls ScraperTaskScheduler.start() to initialize the scraper's scheduled tasks.
      • Serves the index view on the root (/) route.

Public Files Overview

public: Folder containing all the static files for rendering our index page.

Repository

price-scraper-crystal

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 3
  • 3 months ago
  • January 15, 2025
License

Links
Synced at

Fri, 18 Apr 2025 03:24:52 GMT

Languages