Course Outline for Mathematics 88
Introduction to Data Science

Effective: Fall 2025
SLO Rev:
Catalog Description:

MTH 88 - Introduction to Data Science

4.00 Units

Inspired by the plethora of data and abundance of data analysis required in the modern workplace, this course focuses on data-driven inquiry through inferential thinking, applied computational reasoning and real-world problems using sources such as economic data, document collections or geographical data. We will combine statistical inference with computer programming to investigate real-world data sets using such topics as probability distributions, summary statistics, appropriate data displays, confidence intervals, hypothesis tests, debugging, good programming style and social/ethical issues.
Prerequisite: MTH 55 or an appropriate skill level demonstrated through the Mathematics Assessment process
1701.00 - Mathematics, General
Letter Grade Only
Type Units Inside of Class Hours Outside of Class Hours Total Student Learning Hours
Lecture 4.00 72.00 144.00 216.00
Laboratory 0.00 18.00 0.00 18.00
Total 4.00 90.00 144.00 234.00
Measurable Objectives:
Upon completion of this course, the student should be able to:
  1. Choose and apply appropriate statistical techniques to analyze and interpret applications based on real world data such as economic data, business, social sciences, sciences, document collections or geographical data;
  2. Distinguish among different scales of measurement and discuss implications their use;
  3. Write and reframe research questions that can be answered with the available data;
  4. Connect, distinguish and apply concepts of sampling, census, intended populations and populations;
  5. Identify standard sampling and data-gathering methods and compare their respective advantages and disadvantages;
  6. Distinguish between observational studies and experiments, make appropriate conclusions with consideration of causality;
  7. Choose appropriate data displays with informative titles and axis labels and write computer scripts to create them;
  8. Distinguish the difference between sampling distributions and population distributions, and apply the Central Limit Theorem;
  9. Compute measures of central tendency and variation using a programming language; interpret and apply these measures;
  10. Write scripts using randomization to calculate probabilities of various independent or dependent events using marginal, joint and conditional probabilities for both theoretical and empirical probability distributions;
  11. Construct confidence intervals using computer code; interpret confidence intervals in context;
  12. Apply the hypothesis testing framework to appropriate research questions and interpret Type I and II errors in context;
  13. Use appropriate computer code to create theoretical and empirical sampling distributions to find p-values for hypothesis tests; describe and interpret the sampling distribution;
  14. Write computer code to conduct linear regression for estimation and inference; interpret the associated statistics in context;
  15. Generate appropriate computer code and models to compare two random samples and answer questions about the similarities and differences between them;
  16. Make predictions using machine learning techniques such as clustering and linear regression using appropriate computer code;
  17. Write and call expressions and functions using appropriate computer code within efficient and clear algorithms;
  18. Create and manipulate tables using appropriate computer code;
  19. Choose, define and utilize appropriate data types and variables using appropriate computer code;
  20. Create and apply computer code to perform alternation and iteration in context;
  21. Develop and apply techniques to debug computer code;
  22. Identify and discuss social issues related to data collection, maintenance and privacy.
Course Content:
  1. Introduction to data science
    1. Why data science?
    2. Data privacy
    3. Why ethics are important in data science
  2. Causality and Experiments
    1. Establishing causality
    2. Observational studies and experiments
    3. Confounding variables
    4. Random assignment (randomization)
  3. Variables
    1. Measuring quantities on individuals
    2. Types of variables/measurement
    3. Creating categorical variables from quantitative variables
  4. Probability
    1. Definition of probability
    2. Characteristics of probability distribution
    3. Definition of random events and random variables
    4. Empirical and theoretical probability distributions
  5. Data Visualization
    1. Categorical variable distributions
    2. Quantitative variable distributions
    3. Overlaid graphs and side-by-side graphs
    4. Appropriate data displays
    5. Informative titles and labels
  6. Introductory programming skills
    1. Expressions, defining and calling
    2. Names and assignment
    3. Creating and manipulating tables
    4. Errors and debugging
    5. Data types and comparisons
    6. Arrays and sequences
    7. Alternation and iteration
  7. Programming with tables
    1. Sorting by row
    2. Selecting rows
    3. Data cleaning
    4. Manipulating and applying functions to columns
    5. Classifying by one variable
    6. Cross-classifying and two-way tables
    7. Joining tables by column
  8. Simulation
    1. Definition
    2. Process
    3. Examples and applications
    4. Simulating simple probabilities
    5. Simulating statistics according to models
  9. Sampling and empirical distributions
    1. Empirical Distributions
    2. Sampling From a Population
    3. Empirical distribution of a statistic (sampling distribution)
    4. Random samples using computer code
  10. Testing Hypotheses
    1. Create and assess a model for a single category
    2. Simulate statistics under the model
    3. Models with multiple categories
    4. Decisions and uncertainty
    5. Error probabilities
    6. Writing appropriate conclusions
  11. Comparing two samples
    1. A/B testing
    2. Predicting the statistic under the null hypothesis
    3. Permutation tests
    4. Causality
  12. Estimation
    1. Percentiles
    2. Bootstrap
    3. Create confidence intervals
    4. Interpret and use confidence intervals
  13. Why the mean matters
    1. What exactly does the mean measure?
    2. How close to the mean are most of the data?
    3. How sample size is related to variability of the sample mean
    4. Why are empirical distributions of the sample mean bell-shaped?
    5. Effective use of the sample mean for inference
  14. More on the sample mean
    1. Variability
    2. Standard deviation and the normal distribution
    3. Central limit theorem
    4. Variability of the sample mean
  15. Prediction
    1. Correlation
    2. Regression line
    3. Diagnostics for regression
    4. Residuals and residual plots
    5. Inference for regression
    6. Prediction intervals
  16. Classification
    1. Nearest neighbors
    2. Training and testing
    3. Rows of Tables
    4. Implementing the classifier
    5. Accuracy of the classifier
    6. Multiple regression
Methods of Instruction:
  1. Problem Solving
  2. Case Study
  3. Group Activities
  4. Presentation
  5. Practice/Demonstration
  6. Laboratory exercises
  7. Lectures
  8. Textbook reading assignments
  9. Class and group discussions
  10. Research project
  11. Hands-on Activities
  12. Oral and Written Analysis
  13. Computer-based interactive curriculum
  14. Simulations
  15. Written assignments
  16. Lecture/Discussion
Assignments and Methods of Evaluating Student Progress:
  1. For each of the columns in the dataset: concerts, identify if the data contained in that column is numerical or categorical.
  2. The World: The change observed in Bangladesh since 1970 can also be observed in many other developing countries: health services improve, life expectancy increases, and child mortality decreases. At the same time, the fertility rate often plummets, and so the population growth rate decreases despite increasing longevity. Run the cell below to generate two overlaid histograms, one for 1962 and one for 2010, that show the distributions of total fertility rates for these two years among all 201 countries in the fertility table. Question 9. Assign fertility_statements to an array of the numbers of each statement below that can be correctly inferred from these histograms. 1. About the same number of countries had a fertility rate between 3.5 and 4.5 in both 1962 and 2010. 2. In 1962, less than 20% of countries had a fertility rate below 3.
  3. Mira, Sofia, and Sara are trying to use Data Science to find the best burritos in San Diego! Their friends Jessica and Sonya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data The following cell reads in a table called ratings which contains names of burrito restaurants, their Yelp rating, Google rating, as well as their overall rating. The Overall rating is not an average of the Yelp and Google ratings, but rather it is the overall rating of the customers that were surveyed in the study above. It also reads in a table called burritos_types which contains names of burrito restaurants, their menu items, and the cost of the respective menu item at the restaurant. Question 1. It would be easier if we could combine the information in both tables. Assign burritos to the result of joining the two tables together, so that we have a table with the ratings for every corresponding menu item from every restaurant. Each menu item has the same rating as the restaurant from which it is from.
  4. Compute an approximate 95% confidence interval for the proportion of Yes voters in California. The code cell below draws your interval as a red bar below the histogram of resample_yes_proportions; use that to verify that your answer looks right.
  1. Quizzes
  2. Written assignments
  3. Exams/Tests
  4. Oral Presentation
  5. Lab Activities
  6. Group Projects
  7. Laboratory exercises
  8. Final Examination or Project
Upon the completion of this course, the student should be able to:
  1. Analyze mathematical problems critically using logical methodology,
  2. Communicate mathematical ideas, understand definitions, and interpret concepts
  3. Increase confidence in understanding mathematical concepts, communicating ideas and thinking analytically
Textbooks (Typical):
  1. Adhikari, Ani; DeNero, John; Wagner, David (2022). Computational and Inferential Thinking: The Foundations of Data Science (Second edition/e). Creative Commons https://inferentialthinking.com/chapters/intro.html.
  1. Adhikari, Ani; DeNero, John; Wagner, David (2022). Computational and Inferential Thinking: The Foundations of Data Science (Second edition). Creative Commons.

Website: datascience Documentation: https://datascience.readthedocs.io/en/master/?badge=master

Website: Project Jupyter: https://jupyter.org/ Website:

Google Colaboratory: https://colab.research.google.com/

Abbreviated Class Schedule Description:
Data-driven inquiry focused on building workplace competency through statistical inference with computer programming. We will investigate real-world data sets using such topics as probability distributions, summary statistics, appropriate data displays, confidence intervals, hypothesis tests, debugging, good programming style and social/ethical issues.
Prerequisite: MTH 55 or an appropriate skill level demonstrated through the Mathematics Assessment process
Discipline:
Mathematics*