Why SQL Matters More Than Any Other Skill
You can be mediocre at Python and still get by. You can be average at Tableau. You can even be weak at statistics for some analyst roles.
But you cannot be bad at SQL.
SQL is the foundation. It's how you get data. And if you can't get the data, nothing else matters.
Good news: SQL isn't hard. It's just specific. Learn these 10 skills and you'll handle 95% of analyst work.
Skill #1: SELECT with WHERE, ORDER BY, and LIMIT
What it does:
Retrieve specific rows from a table, filter them, sort them, and limit results.
When you use it:
Every. Single. Day.
Example:
SELECT
customer_id,
email,
signup_date
FROM customers
WHERE signup_date >= '2026-01-01'
ORDER BY signup_date DESC
LIMIT 100;
What this means:
"Show me the 100 most recent customers who signed up in 2026."
Why it matters:
This is the first query you'll write in any analysis. If you can't do this, you can't do anything.
Practice:
Find a dataset (Kaggle, Mode Analytics). Write 20 SELECT queries with different filters.
Skill #2: Aggregate Functions (COUNT, SUM, AVG, MIN, MAX)
What it does:
Summarize data (count rows, sum values, calculate averages, find min/max).
When you use it:
Whenever you need summary statistics.
Example:
SELECT
COUNT(*) as total_customers,
AVG(purchase_amount) as avg_purchase,
SUM(purchase_amount) as total_revenue
FROM purchases
WHERE purchase_date >= '2026-01-01';
What this means:
"How many purchases happened in 2026? What's the average purchase amount? What's total revenue?"
Why it matters:
Stakeholders don't want raw data. They want summaries: "How many?" "How much?" "What's the average?"
Pro tip:
- COUNT(*) counts all rows
- COUNT(column_name) counts non-null values in that column
- COUNT(DISTINCT column_name) counts unique values
Skill #3: GROUP BY and HAVING
What it does:
Group rows by a category and apply aggregate functions to each group.
When you use it:
When you want totals/averages broken down by category (e.g., sales by region, signups by source).
Example:
SELECT
traffic_source,
COUNT(user_id) as signups,
AVG(first_session_duration) as avg_session_time
FROM users
WHERE signup_date >= '2026-01-01'
GROUP BY traffic_source
HAVING COUNT(user_id) > 100
ORDER BY signups DESC;
What this means:
"Show me signup counts and average session times for each traffic source (Google, Facebook, email), but only sources with 100+ signups. Sort by most signups first."
Why it matters:
Breaking data down by categories is core analyst work. You'll use GROUP BY constantly.
Common mistake:
Forgetting to include non-aggregated columns in GROUP BY.
❌ This breaks:
SELECT product_id, product_name, SUM(quantity)
FROM sales
GROUP BY product_id; -- Missing product_name!
✅ Fix it:
SELECT product_id, product_name, SUM(quantity)
FROM sales
GROUP BY product_id, product_name;
Skill #4: JOINs (INNER, LEFT, RIGHT, FULL)
What it does:
Combine data from multiple tables based on a common key.
When you use it:
Almost always. Real data is spread across multiple tables.
INNER JOIN (most common)
What it does:
Returns only rows that have matches in both tables.
Example:
SELECT
users.user_id,
users.email,
purchases.purchase_amount,
purchases.purchase_date
FROM users
INNER JOIN purchases ON users.user_id = purchases.user_id
WHERE purchases.purchase_date >= '2026-01-01';
What this means:
"Show me all users who made a purchase in 2026, along with their purchase details."
Result: Only users with purchases appear.
LEFT JOIN (second most common)
What it does:
Returns all rows from the left table, plus matching rows from the right table. If no match, NULL.
Example:
SELECT
users.user_id,
users.email,
purchases.purchase_amount
FROM users
LEFT JOIN purchases ON users.user_id = purchases.user_id;
What this means:
"Show me all users, and their purchase amounts if they've made a purchase. If not, show NULL."
Use case: Finding users who haven't made a purchase.
SELECT
users.user_id,
users.email
FROM users
LEFT JOIN purchases ON users.user_id = purchases.user_id
WHERE purchases.user_id IS NULL;
RIGHT JOIN and FULL OUTER JOIN (rare)
RIGHT JOIN: Opposite of LEFT JOIN (rarely used—just flip the tables and use LEFT JOIN)
FULL OUTER JOIN: Returns all rows from both tables, with NULLs where there's no match (almost never used in analytics)
Pro tip: 90% of your JOINs will be INNER or LEFT. Master those first.
Skill #5: Subqueries and CTEs (WITH clause)
What it does:
Break complex queries into smaller, readable pieces.
Subquery (old way)
SELECT
customer_id,
total_spent
FROM (
SELECT
customer_id,
SUM(purchase_amount) as total_spent
FROM purchases
GROUP BY customer_id
) subquery
WHERE total_spent > 1000;
What this means:
"Find customers who've spent more than $1,000."
Problem: Subqueries get messy fast.
CTE (Common Table Expression) - Better Way
WITH customer_totals AS (
SELECT
customer_id,
SUM(purchase_amount) as total_spent
FROM purchases
GROUP BY customer_id
)
SELECT
customer_id,
total_spent
FROM customer_totals
WHERE total_spent > 1000;
Why CTEs are better:
- More readable
- Can reference the same CTE multiple times
- Easier to debug (run just the CTE part first)
Pro tip: Use CTEs for any query with more than 2 steps.
Skill #6: CASE WHEN (Conditional Logic)
What it does:
Apply if-then logic in SQL.
When you use it:
Creating custom categories, handling nulls, complex calculations.
Example:
SELECT
user_id,
total_purchases,
CASE
WHEN total_purchases >= 10 THEN 'Power User'
WHEN total_purchases >= 5 THEN 'Regular User'
ELSE 'Casual User'
END as user_segment
FROM (
SELECT
user_id,
COUNT(purchase_id) as total_purchases
FROM purchases
GROUP BY user_id
) user_counts;
What this means:
"Categorize users as Power, Regular, or Casual based on purchase count."
Another use case: Handling NULLs
SELECT
user_id,
CASE
WHEN email IS NULL THEN 'No Email'
ELSE email
END as email
FROM users;
Skill #7: Window Functions (ROW_NUMBER, RANK, LAG, LEAD)
What it does:
Perform calculations across rows related to the current row (without grouping).
When you use it:
Ranking, running totals, comparing current row to previous/next row, month-over-month growth.
ROW_NUMBER (Assign a unique number to each row)
SELECT
user_id,
purchase_date,
purchase_amount,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) as purchase_number
FROM purchases;
What this means:
"Number each user's purchases chronologically (1st purchase, 2nd purchase, etc.)."
Use case: Finding first purchase.
WITH ranked_purchases AS (
SELECT
user_id,
purchase_amount,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) as purchase_number
FROM purchases
)
SELECT
user_id,
purchase_amount as first_purchase_amount
FROM ranked_purchases
WHERE purchase_number = 1;
RANK and DENSE_RANK
RANK: Assigns ranks, skipping numbers after ties
DENSE_RANK: Assigns ranks without skipping
Example:
SELECT
product_id,
revenue,
RANK() OVER (ORDER BY revenue DESC) as rank
FROM product_sales;
Result:
Product A: $1000 - Rank 1
Product B: $1000 - Rank 1
Product C: $800 - Rank 3 (skipped 2)
LAG and LEAD (Compare to previous/next row)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_month_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) as month_over_month_change
FROM monthly_sales;
What this means:
"Show each month's revenue, previous month's revenue, and the change between them."
Why window functions matter:
They're the difference between junior and mid-level analysts. If you can't do window functions, you'll hit a ceiling.
Skill #8: Date and Time Functions
What it does:
Manipulate dates (extract month, calculate differences, truncate to day/month/year).
When you use it:
Time-series analysis, cohort analysis, anything involving dates.
Common functions:
DATE_TRUNC (group by day, week, month, year)
SELECT
DATE_TRUNC('month', purchase_date) as month,
COUNT(*) as purchases
FROM purchases
GROUP BY month
ORDER BY month;
EXTRACT (get specific part of date)
SELECT
EXTRACT(YEAR FROM purchase_date) as year,
EXTRACT(MONTH FROM purchase_date) as month,
COUNT(*) as purchases
FROM purchases
GROUP BY year, month;
DATE ARITHMETIC
-- Days since signup
SELECT
user_id,
CURRENT_DATE - signup_date as days_since_signup
FROM users;
-- Filter last 30 days
WHERE purchase_date >= CURRENT_DATE - INTERVAL '30 days'
-- Filter last year
WHERE EXTRACT(YEAR FROM purchase_date) = EXTRACT(YEAR FROM CURRENT_DATE) - 1
Pro tip: Date syntax varies by database (PostgreSQL, MySQL, SQL Server). Google "[your database] date functions" when stuck.
Skill #9: String Functions (CONCAT, UPPER, LOWER, SUBSTRING)
What it does:
Manipulate text data.
When you use it:
Cleaning data, creating labels, extracting parts of strings.
Examples:
CONCAT (combine strings)
SELECT
CONCAT(first_name, ' ', last_name) as full_name
FROM users;
UPPER / LOWER (standardize case)
SELECT
email,
LOWER(email) as email_clean
FROM users;
Use case: Deduplication (email addresses should be case-insensitive).
SUBSTRING (extract part of string)
SELECT
email,
SUBSTRING(email FROM POSITION('@' IN email) + 1) as domain
FROM users;
What this means:
"Extract the domain from email addresses (everything after @)."
TRIM (remove whitespace)
SELECT
TRIM(product_name) as product_name_clean
FROM products;
Why it matters:
Dirty data is everywhere. "New York" vs "new york " vs " New York" should all be the same. String functions fix this.
Skill #10: EXISTS and NOT EXISTS (Efficient Filtering)
What it does:
Check if a row exists in another table (more efficient than JOINs for yes/no questions).
When you use it:
Finding users who did X, or users who didn't do Y.
Example: Find users who made a purchase
SELECT
user_id,
email
FROM users
WHERE EXISTS (
SELECT 1
FROM purchases
WHERE purchases.user_id = users.user_id
);
Example: Find users who never made a purchase
SELECT
user_id,
email
FROM users
WHERE NOT EXISTS (
SELECT 1
FROM purchases
WHERE purchases.user_id = users.user_id
);
Why it's better than LEFT JOIN for this:
❌ This works but is slower:
SELECT users.user_id, users.email
FROM users
LEFT JOIN purchases ON users.user_id = purchases.user_id
WHERE purchases.user_id IS NULL;
✅ This is faster:
SELECT user_id, email
FROM users
WHERE NOT EXISTS (
SELECT 1 FROM purchases WHERE purchases.user_id = users.user_id
);
Why it matters:
On large datasets (millions of rows), EXISTS is significantly faster.
Bonus Skill: Query Optimization (Making SQL Fast)
What it does:
Write queries that run quickly even on large datasets.
When you use it:
When your query takes 5 minutes to run and stakeholders are waiting.
Tips:
-
Use indexes: Ensure commonly filtered columns are indexed (usually done by data engineers, but you should know to ask)
-
Filter early: Apply WHERE before JOIN when possible
❌ Slow:
SELECT * FROM users
JOIN purchases ON users.user_id = purchases.user_id
WHERE purchases.purchase_date >= '2026-01-01';
✅ Faster:
SELECT * FROM users
JOIN (
SELECT * FROM purchases WHERE purchase_date >= '2026-01-01'
) p ON users.user_id = p.user_id;
-
Avoid SELECT *: Only select columns you need
-
Use LIMIT during development: Test with LIMIT 100, remove when final
-
Use EXPLAIN ANALYZE: See query execution plan (advanced but useful)
EXPLAIN ANALYZE
SELECT * FROM users WHERE signup_date >= '2026-01-01';
How to Practice These Skills
Week 1-2: Basics (#1-3)
- SQLBolt exercises
- Mode SQL Tutorial (Basic SQL)
Week 3-4: JOINs (#4)
- Mode SQL Tutorial (Intermediate SQL)
- LeetCode SQL (easy problems)
Week 5-6: Advanced (#5-8)
- Mode SQL Tutorial (Advanced SQL)
- LeetCode SQL (medium problems)
- Real dataset from Kaggle
Week 7-8: Expert (#9-10)
- StrataScratch (real interview questions)
- HackerRank SQL
- Build portfolio project
The SQL Cheat Sheet
Keep this handy:
-- Basic retrieval
SELECT col1, col2 FROM table WHERE condition ORDER BY col1 LIMIT 10;
-- Aggregation
SELECT col1, COUNT(*), AVG(col2) FROM table GROUP BY col1 HAVING COUNT(*) > 10;
-- JOIN
SELECT a.*, b.col FROM table_a a INNER JOIN table_b b ON a.id = b.id;
-- LEFT JOIN
SELECT a.*, b.col FROM table_a a LEFT JOIN table_b b ON a.id = b.id;
-- CTE
WITH cte AS (SELECT col1, COUNT(*) as cnt FROM table GROUP BY col1)
SELECT * FROM cte WHERE cnt > 5;
-- CASE WHEN
SELECT col1, CASE WHEN col2 > 10 THEN 'High' ELSE 'Low' END as category FROM table;
-- Window functions
SELECT col1, col2, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col2 DESC) as rank FROM table;
-- Date filtering
SELECT * FROM table WHERE date_col >= CURRENT_DATE - INTERVAL '30 days';
-- EXISTS
SELECT * FROM table_a WHERE EXISTS (SELECT 1 FROM table_b WHERE table_b.id = table_a.id);
The Bottom Line
These 10 skills cover 95% of SQL you'll write as a data analyst.
Master them and you'll be able to:
- Answer any business question that can be answered with data
- Pass technical interviews
- Be productive from day one at a new job
You don't need to memorize every function. You need to understand concepts and know what's possible. Google the syntax when you forget.
Start with #1-4 (the basics). Once you can write JOINs in your sleep, move to #5-8 (the advanced stuff that gets you promoted).
Then practice with real data. Theory is fine, but you learn SQL by writing queries.
Ready to put these skills to use? Browse data analyst jobs and see what companies are hiring for.