Course Outline for Mathematics 88
Introduction to Data Science

Effective: Fall 2025
SLO Rev:

Catalog Description:

MTH 88 - Introduction to Data Science

4.00 Units

Inspired by the plethora of data and abundance of data analysis required in the modern workplace, this course focuses on data-driven inquiry through inferential thinking, applied computational reasoning and real-world problems using sources such as economic data, document collections or geographical data. We will combine statistical inference with computer programming to investigate real-world data sets using such topics as probability distributions, summary statistics, appropriate data displays, confidence intervals, hypothesis tests, debugging, good programming style and social/ethical issues.

Prerequisite: MTH 55 or an appropriate skill level demonstrated through the Mathematics Assessment process

CB03: TOP Code 1701.00 - Mathematics, General

Course Grading: Letter Grade Only

Type	Units	Inside of Class Hours	Outside of Class Hours	Total Student Learning Hours
Lecture	4.00	72.00	144.00	216.00
Laboratory	0.00	18.00	0.00	18.00
Total	4.00	90.00	144.00	234.00

Measurable Objectives:

Upon completion of this course, the student should be able to:

Choose and apply appropriate statistical techniques to analyze and interpret applications based on real world data such as economic data, business, social sciences, sciences, document collections or geographical data;
Distinguish among different scales of measurement and discuss implications their use;
Write and reframe research questions that can be answered with the available data;
Connect, distinguish and apply concepts of sampling, census, intended populations and populations;
Identify standard sampling and data-gathering methods and compare their respective advantages and disadvantages;
Distinguish between observational studies and experiments, make appropriate conclusions with consideration of causality;
Choose appropriate data displays with informative titles and axis labels and write computer scripts to create them;
Distinguish the difference between sampling distributions and population distributions, and apply the Central Limit Theorem;
Compute measures of central tendency and variation using a programming language; interpret and apply these measures;
Write scripts using randomization to calculate probabilities of various independent or dependent events using marginal, joint and conditional probabilities for both theoretical and empirical probability distributions;
Construct confidence intervals using computer code; interpret confidence intervals in context;
Apply the hypothesis testing framework to appropriate research questions and interpret Type I and II errors in context;
Use appropriate computer code to create theoretical and empirical sampling distributions to find p-values for hypothesis tests; describe and interpret the sampling distribution;
Write computer code to conduct linear regression for estimation and inference; interpret the associated statistics in context;
Generate appropriate computer code and models to compare two random samples and answer questions about the similarities and differences between them;
Make predictions using machine learning techniques such as clustering and linear regression using appropriate computer code;
Write and call expressions and functions using appropriate computer code within efficient and clear algorithms;
Create and manipulate tables using appropriate computer code;
Choose, define and utilize appropriate data types and variables using appropriate computer code;
Create and apply computer code to perform alternation and iteration in context;
Develop and apply techniques to debug computer code;
Identify and discuss social issues related to data collection, maintenance and privacy.

Course Content:

Introduction to data science
1. Why data science?
2. Data privacy
3. Why ethics are important in data science
Causality and Experiments
1. Establishing causality
2. Observational studies and experiments
3. Confounding variables
4. Random assignment (randomization)
Variables
1. Measuring quantities on individuals
2. Types of variables/measurement
3. Creating categorical variables from quantitative variables
Probability
1. Definition of probability
2. Characteristics of probability distribution
3. Definition of random events and random variables
4. Empirical and theoretical probability distributions
Data Visualization
1. Categorical variable distributions
2. Quantitative variable distributions
3. Overlaid graphs and side-by-side graphs
4. Appropriate data displays
5. Informative titles and labels
Introductory programming skills
1. Expressions, defining and calling
2. Names and assignment
3. Creating and manipulating tables
4. Errors and debugging
5. Data types and comparisons
6. Arrays and sequences
7. Alternation and iteration
Programming with tables
1. Sorting by row
2. Selecting rows
3. Data cleaning
4. Manipulating and applying functions to columns
5. Classifying by one variable
6. Cross-classifying and two-way tables
7. Joining tables by column
Simulation
1. Definition
2. Process
3. Examples and applications
4. Simulating simple probabilities
5. Simulating statistics according to models
Sampling and empirical distributions
1. Empirical Distributions
2. Sampling From a Population
3. Empirical distribution of a statistic (sampling distribution)
4. Random samples using computer code
Testing Hypotheses
1. Create and assess a model for a single category
2. Simulate statistics under the model
3. Models with multiple categories
4. Decisions and uncertainty
5. Error probabilities
6. Writing appropriate conclusions
Comparing two samples
1. A/B testing
2. Predicting the statistic under the null hypothesis
3. Permutation tests
4. Causality
Estimation
1. Percentiles
2. Bootstrap
3. Create confidence intervals
4. Interpret and use confidence intervals
Why the mean matters
1. What exactly does the mean measure?
2. How close to the mean are most of the data?
3. How sample size is related to variability of the sample mean
4. Why are empirical distributions of the sample mean bell-shaped?
5. Effective use of the sample mean for inference
More on the sample mean
1. Variability
2. Standard deviation and the normal distribution
3. Central limit theorem
4. Variability of the sample mean
Prediction
1. Correlation
2. Regression line
3. Diagnostics for regression
4. Residuals and residual plots
5. Inference for regression
6. Prediction intervals
Classification
1. Nearest neighbors
2. Training and testing
3. Rows of Tables
4. Implementing the classifier
5. Accuracy of the classifier
6. Multiple regression

Methods of Instruction:

Problem Solving
Case Study
Group Activities
Presentation
Practice/Demonstration
Laboratory exercises
Lectures
Textbook reading assignments
Class and group discussions
Research project
Hands-on Activities
Oral and Written Analysis
Computer-based interactive curriculum
Simulations
Written assignments
Lecture/Discussion

Assignments and Methods of Evaluating Student Progress:

1. Typical Assignments

For each of the columns in the dataset: concerts, identify if the data contained in that column is numerical or categorical.
The World: The change observed in Bangladesh since 1970 can also be observed in many other developing countries: health services improve, life expectancy increases, and child mortality decreases. At the same time, the fertility rate often plummets, and so the population growth rate decreases despite increasing longevity. Run the cell below to generate two overlaid histograms, one for 1962 and one for 2010, that show the distributions of total fertility rates for these two years among all 201 countries in the fertility table. Question 9. Assign fertility_statements to an array of the numbers of each statement below that can be correctly inferred from these histograms. 1. About the same number of countries had a fertility rate between 3.5 and 4.5 in both 1962 and 2010. 2. In 1962, less than 20% of countries had a fertility rate below 3.
Mira, Sofia, and Sara are trying to use Data Science to find the best burritos in San Diego! Their friends Jessica and Sonya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data The following cell reads in a table called ratings which contains names of burrito restaurants, their Yelp rating, Google rating, as well as their overall rating. The Overall rating is not an average of the Yelp and Google ratings, but rather it is the overall rating of the customers that were surveyed in the study above. It also reads in a table called burritos_types which contains names of burrito restaurants, their menu items, and the cost of the respective menu item at the restaurant. Question 1. It would be easier if we could combine the information in both tables. Assign burritos to the result of joining the two tables together, so that we have a table with the ratings for every corresponding menu item from every restaurant. Each menu item has the same rating as the restaurant from which it is from.
Compute an approximate 95% confidence interval for the proportion of Yes voters in California. The code cell below draws your interval as a red bar below the histogram of resample_yes_proportions; use that to verify that your answer looks right.

2. Methods of Evaluating Student Progress

Quizzes
Written assignments
Exams/Tests
Oral Presentation
Lab Activities
Group Projects
Laboratory exercises
Final Examination or Project

3. Student Learning Outcomes

Upon the completion of this course, the student should be able to:

Analyze mathematical problems critically using logical methodology,
Communicate mathematical ideas, understand definitions, and interpret concepts
Increase confidence in understanding mathematical concepts, communicating ideas and thinking analytically

Textbooks (Typical):

OER:

Adhikari, Ani; DeNero, John; Wagner, David (2022). Computational and Inferential Thinking: The Foundations of Data Science (Second edition/e). Creative Commons https://inferentialthinking.com/chapters/intro.html.

Textbook:

Adhikari, Ani; DeNero, John; Wagner, David (2022). Computational and Inferential Thinking: The Foundations of Data Science (Second edition). Creative Commons.

Additional Materials:

Website: datascience Documentation: https://datascience.readthedocs.io/en/master/?badge=master

Website: Project Jupyter: https://jupyter.org/ Website:

Google Colaboratory: https://colab.research.google.com/

Abbreviated Class Schedule Description:

Data-driven inquiry focused on building workplace competency through statistical inference with computer programming. We will investigate real-world data sets using such topics as probability distributions, summary statistics, appropriate data displays, confidence intervals, hypothesis tests, debugging, good programming style and social/ethical issues.

Prerequisite: MTH 55 or an appropriate skill level demonstrated through the Mathematics Assessment process

Discipline:

Mathematics*

Course Outline for Mathematics 88Introduction to Data Science

MTH 88 - Introduction to Data Science

4.00 Units

Discipline:

Course Outline for Mathematics 88
Introduction to Data Science