4 min read
data store

Disclaimer: I have found it a lil tough to translate my mental model about XYZ topic in words to juniors. I’m able to explain it to them, but I’m making up stuff on the fly when I do that. So, I’m using LLM to do this now. Most of what you see below is LLM. I have pestered LLMs enough till I like what I read.

you are a junior. you gotta store some data. for your application ofcourse.


To help a junior build a mental model for deciding how/where to store data, focus on asking key questions and mapping answers to storage “archetypes.” Here’s a framework:


🌟 Step 1: Ask These Questions

  1. What shape is the data?

    • Structured (e.g., spreadsheets, tables)?
    • Semi-structured (e.g., JSON, logs)?
    • Unstructured (e.g., images, videos)?
  2. How is it accessed?

    • Frequent small reads/writes (e.g., user profiles)?
    • Rarely updated, but heavily queried (e.g., analytics)?
    • Bulk storage with no direct querying (e.g., backups)?
  3. How much data is there?

    • Gigabytes? Terabytes? Petabytes?
  4. What’s the performance need?

    • Milliseconds for queries (e.g., user-facing apps)?
    • Hours for batch processing (e.g., reports)?
  5. Who/what uses it?

    • Humans (via dashboards)?
    • Machines (APIs, apps)?

🌟 Step 2: Map to Storage Archetypes

Think of storage systems as tools for specific jobs:

ArchetypeBest ForExample ToolsAnalogy
Relational DatabaseStructured data needing ACID* transactions, complex queries.PostgreSQL, MySQL, RDSA filing cabinet with labeled folders. Strict rules, perfect for precise lookups.
Object StoreUnstructured/semi-structured data, cheap bulk storage.S3, GCS, Azure BlobA warehouse for boxes. Dump anything, retrieve by barcode (no search).
Data WarehouseAnalytics on large structured/semi-structured datasets.Redshift, Snowflake, BigQueryA research library. Optimized for big questions (e.g., “Total sales last year?”).
NoSQL DatabaseSemi-structured data, flexible schemas, scale horizontally.MongoDB, DynamoDBA flexible shelf. Store varied items (books, tools, toys) and search by tags.
File SystemHierarchical data (e.g., directories of files).NFS, EFS, HDFSA physical office desk. Folders inside drawers, familiar but limited scalability.

ACID = Atomicity, Consistency, Isolation, Durability (critical for transactions like banking).


🌟 Step 3: Decision Flow

  1. Start with the data shape:

    • Structured → Relational DB or Warehouse.
    • Unstructured → Object Store.
    • Semi-structured → NoSQL or Warehouse.
  2. Ask about access patterns:

    • “Need to fetch one row fast?” → Relational DB.
    • “Need to scan millions of rows?” → Warehouse.
    • “Just storing files?” → Object Store.
  3. Consider scale:

    • Small datasets → Almost anything works.
    • Massive datasets → Object Store, Warehouse, or NoSQL.

🌟 Key Principles

  • Hot vs. Cold Data:

    • Hot (frequently accessed) → Fast storage (DBs).
    • Cold (rarely accessed) → Cheap storage (Object Store).
  • Cost vs. Performance:

    • Faster/structured = More expensive.
    • Slower/unstructured = Cheaper.
  • Flexibility:

    • Object Stores are “dumb” but infinitely scalable.
    • Databases/Warehouses are “smart” but constrained by schema.

”Start with the problem, not the tool”

never a hammer finding a nail to hammer please