Preface

So, I’m taking a course from Stanford’s online Lagunita titled Databases, the self-paced series. the Databases course is comprised of 14 “mini-courses.” It is up to the learner which mini-courses to take, and you are welcome to mix and match. However, they do give suggestions on which set of mini-courses to take, depending on what your goals are:

Do you just want to learn the bare basics and start doing SQL? Take their Practical Relational Databases and SQL pathway, which suggests two core mini-courses.
Do you want to just learn about semistructured data? Take the Semistructured Data pathway, which suggests the XML DATA and JSON DATA mini-courses.
Do you want the full database course? Take all of the mini-courses!

Personally, I’m following their Relational Databases and SQL: Foundations and Practice pathway, which includes:

Introduction and Relational Databases (the topic of this post)
Relational Algebra
SQL
Relational Design Theory
Unified Modeling Language (UML)
Indexes and Transactions
Constraints and Triggers
Views and Authorization
On-line Analytical Processing
Recursion in SQL

I’m going to give a bunch of bullet points and notes about what I’ve learned. Here we go!

Introduction and Relational Databases

DBMS stands for Database Management System. They are:
- Massive: On the scale of terabytes of data, daily
- Persistent: Your data continues to exist after the programs have done executing upon it
- Safe: Will survive through soft/hardware issues, won’t cease to exist in power outages, resists malicious users
- Multi-User: Many users can access and change the data concurrently
- Convenient: Physical data independence, accessed by high level languages in a declarative way
- Efficient: Can handle thousands of updates a second. “Performance, performance, performance.”
- Reliable: You better have 99.9999% uptime!

Course Caveats

Database application frameworks and middleware will not be discussed in the course.
Not all data-intensive apps use DBMS (they can use text files, spreadsheets, etc.)

Key Concepts

Data Model: Thought of as a set of records, XML, graph, etc.
Schema vs. Data
- Schema sets up structure of the database: set up in the beginning, doesn’t change much
- Data is the actual data values in the schema: changes rapidly
Data Definition Language (DDL): Used to set up the schema
Data Manipulation Language (DML): After schema is set up, DML is used to query and modify the database

Key People

DBMS Implementer: Builds the system, not the focus of the course
Database Designer: Establishes the schema for the database. Can get very complicated
Database Application Developer: Makes programs that operate on the database
Database Administrator:
- Loads the data, gets the database running, and keeps it running
- Manages tuning parameters
- For large deployments, is the most important person in the entire process

The Relational Model

Popular
Simple
Queried with high-level languages
Efficient

Definitions

A database is defined as a set of relations (or tables)
Each relation has a set of named attributes (or columns)
Each tuple (or row) has a value for each attribute
Each attribute has a domain (or type)
- Integers, strings, etc.
Schema: Structural description of relations in database
Instance: Actual contents at given point in time
NULL: Special value for something unknown or undefined
- Be careful when running queries over NULL values
Key: Attribute of a relation where every value for that attribute is unique
- Ex: Student IDs at a college, Social Security numbers, etc.
- A set of attributes whose combined values are unique is also a key

Querying Relational Databases

Steps
1. Design schema; create using DDL
2. “Bulk load” initial data
3. Execute queries and modifications to the database

Ad-hoc queries in high-level languages
- Ad-hoc: “Can pose queries that you didn’t think of in advance, so you don’t need to write programs for each query beforehand.”
- High-level: “You can write in a fairly compact fashion rather complicated queries and you don’t have to write the algorithms that get the data out of the database.”

Some queries are easy to pose, some are harder.
Some queries are easy to execute efficiently, some are harder.
Not correlated. (Some easy to pose queries are hard to execute efficiently, and vice versa)

Queries return relations
- Closed: Getting back the same type of object that you query
- Compositionality: When you can run a query over the result of a previous query

Query in English: IDs of students with GPA > 3.7 applying to Stanford. How do we translate this into the different query languages?

Query Languages:

Relational Algebra: Formal

\project_{Student.ID} (
  \select_{GPA>3.7 and cName='Stanford'}
  (Student 
  \join Apply))

SQL: Actual/Implemented

SELECT Student.ID
FROM Student, Apply
WHERE Student.ID=Apply.ID
  AND GPA>3.7 
  AND college='Stanford'

My thoughts

Not much to say, except that I’m excited to dig into databases!

Introduction to Databases (Part 1)