data store | Srijan Shukla

Disclaimer: I have found it a lil tough to translate my mental model about XYZ topic in words to juniors. I’m able to explain it to them, but I’m making up stuff on the fly when I do that. So, I’m using LLM to do this now. Most of what you see below is LLM. I have pestered LLMs enough till I like what I read.

you are a junior. you gotta store some data. for your application ofcourse.

To help a junior build a mental model for deciding how/where to store data, focus on asking key questions and mapping answers to storage “archetypes.” Here’s a framework:

🌟 Step 1: Ask These Questions

What shape is the data?
- Structured (e.g., spreadsheets, tables)?
- Semi-structured (e.g., JSON, logs)?
- Unstructured (e.g., images, videos)?
How is it accessed?
- Frequent small reads/writes (e.g., user profiles)?
- Rarely updated, but heavily queried (e.g., analytics)?
- Bulk storage with no direct querying (e.g., backups)?
How much data is there?
- Gigabytes? Terabytes? Petabytes?
What’s the performance need?
- Milliseconds for queries (e.g., user-facing apps)?
- Hours for batch processing (e.g., reports)?
Who/what uses it?
- Humans (via dashboards)?
- Machines (APIs, apps)?

🌟 Step 2: Map to Storage Archetypes

Think of storage systems as tools for specific jobs:

Archetype	Best For	Example Tools	Analogy
Relational Database	Structured data needing ACID* transactions, complex queries.	PostgreSQL, MySQL, RDS	A filing cabinet with labeled folders. Strict rules, perfect for precise lookups.
Object Store	Unstructured/semi-structured data, cheap bulk storage.	S3, GCS, Azure Blob	A warehouse for boxes. Dump anything, retrieve by barcode (no search).
Data Warehouse	Analytics on large structured/semi-structured datasets.	Redshift, Snowflake, BigQuery	A research library. Optimized for big questions (e.g., “Total sales last year?”).
NoSQL Database	Semi-structured data, flexible schemas, scale horizontally.	MongoDB, DynamoDB	A flexible shelf. Store varied items (books, tools, toys) and search by tags.
File System	Hierarchical data (e.g., directories of files).	NFS, EFS, HDFS	A physical office desk. Folders inside drawers, familiar but limited scalability.

ACID = Atomicity, Consistency, Isolation, Durability (critical for transactions like banking).

🌟 Step 3: Decision Flow

Start with the data shape:
- Structured → Relational DB or Warehouse.
- Unstructured → Object Store.
- Semi-structured → NoSQL or Warehouse.
Ask about access patterns:
- “Need to fetch one row fast?” → Relational DB.
- “Need to scan millions of rows?” → Warehouse.
- “Just storing files?” → Object Store.
Consider scale:
- Small datasets → Almost anything works.
- Massive datasets → Object Store, Warehouse, or NoSQL.

🌟 Key Principles

Hot vs. Cold Data:
- Hot (frequently accessed) → Fast storage (DBs).
- Cold (rarely accessed) → Cheap storage (Object Store).
Cost vs. Performance:
- Faster/structured = More expensive.
- Slower/unstructured = Cheaper.
Flexibility:
- Object Stores are “dumb” but infinitely scalable.
- Databases/Warehouses are “smart” but constrained by schema.

”Start with the problem, not the tool”

never a hammer finding a nail to hammer please