Psychology Research Experience Program (PREP) provides mentoring and experience to undergraduates who have an interest in a scientific psychology career. LUCID partnered with PREP to create a hands-on data science workshop series. LUCID & AI + Society graduate students will facilitate the data science workshops.
The 2023 data science workshop will be held on Wednesdays from 3:30-5p in the WID room 3330.
Overview:
In the workshops students will be introduced to data-science environments, concepts, and applications. LUCID & AI+ Society facilitators will introduce a series of data science concepts via online materials and hands-on sessions. PREP Students will work through examples and demos with guidance from LUCID & AI+ Society graduate students.
Goals:
For PREP students to gain a sense of 1) how to work with an R or Python integrated development environment, 2) the kinds of things one can do with a range of data-science tools, and 3) how to continue learning about and working with these tools in the future. Note that the goal will not be specifically to teach programming in R, Python, or any other language, but how to work interactively with and adapt notebooks that carry out common data-science tasks, and to get a general sense of what the methods are used for and how they might be applied to one’s own data.
Materials and Session Outlines:
This will be updated with materials and facilitator outlines as they become available.
Schedule:
Date | Facilitator | Topic |
7/5 | Tim Rogers | Intro to Data Science |
7/12 | Kushin Mukherjee |
Running Web Experiments
|
7/19 | Sid Suresh | How can cognitive scientists use Deep Learning? Using pre-trained models to perform psychology experiments |
7/26 | Sarah Sant’Ana | Cross Validation |
This is an accordion element with a series of buttons that open and close related content panels.
Running Web Experiments: A soup-to-nuts tutorial
Running Web Experiments: A soup-to-nuts tutorial
PREP 2023
Kushin Mukherjee
This tutorial will walk you through how to design and run behavioral experiments in the web browser. Once you build an experiment you can have participants do your experiment online through MTurk, Prolific, or any other crowdsourcing platform. You can also have participants come into the lab and complete these experiments on computers running the experiment locally.
This tutorial will (time permitting) have 3 parts: (1) Setting up the necessary accounts on OSF, Github, and DataPipe. (2) Designing a simple experiment using jsPsych and hosting it using Github pages. (3) Running yourself through your spiffy new experiment and looking at your data.
Preparation
Before we begin, we will need to create some accounts on some websites. If you plan to stick around and do Psychology research moving forward, you’ll almost definitely need these accounts down the line too. I recommend setting them up using an email you are confident that you’ll have access to forever (so maybe not a university email if it expires when you graduate).
Ideally use the same email for all 3 accounts.
Github
- Make a new account here – https://github.com/
OSF (Open Science Framework)
- Make a new account here – https://osf.io/
DataPipe
- Make a new account here – https://pipe.jspsych.org/
View Tutorial Here (materials will be available after 7/12)
Cross Validation
Cross Validation with Sarah Sant’Ana
Cross validation is a common resampling technique used in machine learning studies. Broadly, cross validation involves splitting data into multiple training and testing subsets to increase generalizability of the model building and evaluation processes. There are multiple types of cross validation (e.g. k-fold, bootstrapped), but all serve two primary purposes:
- To select the best model configurations (e.g. what type of statistical model will perform best, which sets of features will perform best, covariate selection, hyperparameter tuning, outlier identification approaches, predictor transformations, and more).
- To evaluate the expected performance of our models in new data (i.e. on individuals who were never used in model building/selection)
Why should I use cross validation?
You should use cross validation if..
- You are fitting a statistical model with hyperparameters that need tuning (e.g. elastic-net logistic regression, random forests, svm)
- You are considering multiple combinations of model configurations (e.g. features, statistical algorithms, data transformations)
- You want to consider a large number of predictive features or you do not want to rely on theory to guide identification of predictive features
- You want to build predictive models that will generalize well to new data (i.e. you want your model to be applied in some way)
List of ideas, concepts, or tools that are associated with this topic
- R/RStudio (especially the caret package, tidymodels, and parsnip packages)
- Python
- Common types of cross validation (CV): bootstrapped CV, k-fold CV, nested CV
- Basic knowledge of linear and logistic regression
- Bias/variance trade offs in model fitting and evaluation
- Generalizability of predictive models (why its important, how to prioritize it, and how to assess it)
In preparation for our meeting, please review the following materials:
- For framing, please read the beginning of Yarkoni & Westfall (2017) http://jakewestfall.org/publications/Yarkoni_Westfall_choosing_prediction.pdf through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during the session – it is not necessary to have a crystal clear understanding!
- Watch this solid 6 minute explanation of cross validation by Statquest https://www.youtube.com/watch?v=fSytzGwwBVw
- Skim this “big picture” blog post that provides more clarity surrounding distinctions between model evaluation and selection A “short” introduction to model selection
During the meeting:
- Plan on a discussion about prediction vs explanation in psychological research. I want to help you think of how you might apply cross validation in your work if you are interested 😊
- I will be walking us through the attached Cross Validation Markdown document (open this link then download, google will default to open as a g-doc that is not functional) to provide you some code for implementing cross validation. No need to read this beforehand, but you can have it open during the session if you’d like to follow along.
- Feel free to send me any questions beforehand or ask during the session! Happy to talk research, data science, or grad school as would feel beneficial to you all. My email is skittleson@wisc.edu
Additional Materials (not required, just for your reference)
Books
- Here are selected readings on cross validation from two *free* online textbooks: James et al. (2013) Chapter 5: Resampling Methods (pp 175 – 186) and Kuhn and Johnson (2018) Chapter 4: Resampling Techniques (pp 67 – 78). These books are amazing for learning about any sort of applied statistical learning – highly recommend!
Online tutorials (blogs and code examples):
- This is an R Markdown file written by the creator of the caret package in R (one of the most used machine learning packages in R to date). It explains how to tune the various types of hyperparameters using CV within carets train function. Even if you don’t plan to use R, it is helpful to see what types of parameters are tuned for different models and provides examples of creating and evaluating search grids, alternate performance metrics, and more. Model training and tuning
- This is a nice (but lengthy) R Markdown example of approaching a classic machine learning problem (product price estimation) and showcases hyperparameter tuning of a couple of different algorithms (and their comparison): Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial. This is geared towards a more advanced beginner – It still walks you through everything, but incorporates more robust data cleaning and exploration before model fitting.
Videos:
- This video is a good walkthrough using K-fold cross-validation in python to select optimal tuning parameters, choose between models, and select features: Selecting the best model in scikit-learn using cross-validation
- A short 4 minute tutorial about how to tune various types of statistical learning models within cross validation using the caret package in R. It doesn’t discuss much of the theory and is more appropriate for application focused users who are just trying to figure out how to implement parameter tuning within CV: R Tutorial – Hyperparameter tuning in caret
Papers:
- This paper describes the impact of using different CV types for parameter selection and model evaluation: Bias in error estimation when using cross-validation for model selection.This requires intermediate level understanding of using CV for parameter selection. Many people using machine learning in applied contexts are using improper CV methods that bias their model performance estimates. We should be using nested CV (or bootstrap CV with a separate validation set) if we are planning to select model parameters and generate trustworthy performance metrics
- Really cool preprint that describes sources of bias in ML resampling methods due to incorrect application in psychological research https://psyarxiv.com/2yber/. A more intermediate level read because it requires some understanding of multiple types of CV methods.
Using pre-trained models to perform psychology experiments
Using CNNs to run psychophysics experiments
With Sid Suresh
Humans can quickly pool information from across many individual objects to perceive ensemble properties, like the average size or color diversity of objects. Such ensemble perception in humans is thought to occur extremely efficiently and automatically. We’ll learn about how we run experiments on a CNN to understand if ensemble representations of average size emerge in them.
The goal :
(1) Understand how we can design and run a psychophysics experiment using a pre-trained Convolutional Neural Network.
List of ideas/concepts/tools that are associated with this topic
Ensemble representations
Convolutional neural networks
Python
Google collab
Linear Regression
Logistic Regression
Prepare for the LUCID/PREP Data Science Workshop on Computational Vision Models
- Watch previous weeks videos about CNNs
- Read the abstract – https://jov.arvojournals.org/article.aspx?articleid=2777780
- Watch the video – https://www.youtube.com/watch?v=fKpKujKH1W0&ab_channel=NeuromatchConference
Colab Notebook
Here is the notebook we’ll be working together on this Wednesday – https://github.com/siddsuresh97/prep_tutorial/blob/main/tutorial.ipynb
Resources and Sessions from 2022:
This is an accordion element with a series of buttons that open and close related content panels.
Mixed Linear Models
Introduction to Mixed Linear Models with Melissa Schoenlein
Linear models are a type of analysis used to evaluate data that is fully independent.
Mixed effects models are a type of analysis used to evaluate data with non-independence that cannot otherwise be analyzed with regular linear regression.
What is non-independence/non-independent data?
Non-independence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are non-independent since they are from the same person and thus are related in ways beyond the experimental procedure. In other words, data points from one participant are more likely to be more similar to each other than data points from two different participants.
Non-independence can exist beyond repeated measures at the participant level to any items occurring within “units”, including classrooms, family members, cities, etc.
Why/when should I use mixed linear models?
Using regular linear regression when data is non-independent can lead to inflated Type 1 error rates (saying that you have a significant result, when you actually don’t!!), less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a non-independent relationship in the data.
List of ideas/concepts/tools that are associated with this topic
Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.
Nonindependence
Fixed versus random effects
Lme4 package in R
Prepare for the LUCID/PREP Data Science Workshop on mixed models:
In preparation for our meeting on Wednesday 6/29, please watch, read, and download the following materials. 1. Watch videos 2, 3, 11, and 16 from this multi-part video series providing a general overview of mixed models, when to use them, and how to interpret them. Total time for these 4 videos: ~ 12 minutes. 2. Read through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them. We will work together through some examples in R during the workshop, so this tutorial will provide a good foundation for being ready to apply the code to different contexts.
3. Install the following packages in R: lme4, ggplot2
Optional additional resources if you’re interested in learning more:
Tutorials: A very short cheat sheet of using the lme4 package in R to analyze mixed models.
A Github repo with a 3-part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.
A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.
Papers: This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.
Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.
Computational Vision Models
Using pre-trained models to perform psychology experiments
Using CNNs to run psychophysics experiments
Humans can quickly pool information from across many individual objects to perceive ensemble properties, like the average size or color diversity of objects. Such ensemble perception in humans is thought to occur extremely efficiently and automatically. We’ll learn about how we run experiments on a CNN to understand if ensemble representations of average size emerge in them.
The goal :
(1) Understand how we can design and run a psychophysics experiment using a pre-trained Convolutional Neural Network.
List of ideas/concepts/tools that are associated with this topic
Ensemble representations
Convolutional neural networks
Python
Google collab
Linear Regression
Logistic Regression
Prepare for the LUCID/PREP Data Science Workshop on Computational Vision Models
- Watch previous weeks videos about CNNs
- Read the abstract – https://jov.arvojournals.org/article.aspx?articleid=2777780
- Watch the video – https://www.youtube.com/watch?v=fKpKujKH1W0&ab_channel=NeuromatchConference
Colab Notebook
Here is the notebook we’ll be working together on this Wednesday – https://github.com/siddsuresh97/prep_tutorial/blob/main/tutorial.ipynb
Designing experiments with JsPsych & data cleaning in R
Natural Language Processing
Resources and Sessions from 2021:
This is an accordion element with a series of buttons that open and close related content panels.
Natural Language Processing (NLP)
Session 1: Introduction to Natural Language Processing
Prepared by Laura Stegner, stegner@wisc.edu
What is Natural Language Processing?
Natural Language Processing (NLP) can be broadly thought of as the computational tools used to help computers understand and manipulate spoken or written natural language to do useful things. This goal can be achieved with the help of various NLP tasks, such as:
- Part of speech taggings
- Speech recognition
- Word sense disambiguation
- Sentiment analysis
- Natural langauge generation
- Named entity recognition
- Co-reference resolution
Each of the above tasks is briefly described in this article by IBM.
Practically, NLP is present in our everyday lives. Some common examples include autocorrect, autocomplete, related search terms in a web engine, email filtering, smart agents (e.g. Siri or Alexa), and machine translation (e.g. Google Translate). It is also useful in business applications such as to analyze reviews or to create automated calling systems and chat bot assistants.
When would I want to use NLP?
While NLP is being readily implemented in everyday products, it is also greatly useful in data science. NLP can be used to convert messy, unstructured natural language responses (such as interview data or open responses to survey questions) into more structured, processable data forms. Using NLP techniques to analyze data can serve to speed up processing time and also eliminate inconsistencies from manual analysis.
Preparation
Prior to our meeting, please review the following materials:
- (optional but interesting) Article that walks through the history of NLP: https://machinelearningmastery.com/natural-language-processing/
- High level introduction to NLP (12-minute video): https://www.youtube.com/watch?v=fOvTtapxa9c
- Slightly different take on NLP (4-minutes video): https://www.youtube.com/watch?v=d4gGtcobq8M
- Lecture that introduces sentiment analysis (7-minte video): https://www.youtube.com/watch?v=S4z0UG07-b0
- Article about bias in NLP: https://towardsdatascience.com/bias-in-natural-language-processing-nlp-a-dangerous-but-fixable-problem-7d01a12cf0f7
- Short article about general ethical considerations when using NLP in a clinical setting: https://arxiv.org/pdf/1703.10090.pdf
Also think about the following. We will have a discussion related to some of these topics 🙂
- Times you have encountered NLP in either your research or your daily life.
- Situations where you don’t use NLP but why it would come in handy, and how.
- Why we should care about the ethical considerations of NLP in data science.
Additionally, install the following packages in Python 3:
- nltk:
pip3 install nltk==3.3
orpython3 -m pip install nltk==3.3
Additional Reading / Reference
Chowdhury, G.G. (2003), Natural language processing. Ann. Rev. Info. Sci. Tech., 37: 51-89. https://doi-org.ezproxy.library.wisc.edu/10.1002/aris.1440370103
Hovy, D., & Spruit, S. L. (2016, August). The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 591-598). https://www.aclweb.org/anthology/P16-2096.pdf
Leidner, J. L., & Plachouras, V. (2017, April). Ethical by design: Ethics best practices for natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing (pp. 30-40). https://www.aclweb.org/anthology/W17-1604.pdf
Tutorial and demo materials are on Laura’s github site:
https://github.com/lstegner/nlp-tutorial-PREP2021/tree/main/tutorial-materials
Regularization
Introduction to Regularization with Kendra Wyant
What is Regularization?
Regularization is a type of regression that imposes a penalty to coefficients in complex models. This penalty reduces overfitting by introducing some bias into the model. As we see with the bias-variance tradeoff, introducing some bias can reduce variance in model predictions on new data making the model more generalizable.
Types of regularization
* Ridge regression: variables with minor contribution have their coefficients close to zero. However, all the variables are incorporated in the model. This is useful when all variables need to be incorporated in the model according to domain knowledge.
* Lasso regression: the coefficients of some less contributive variables are forced to be exactly zero. Only the most significant variables are kept in the final model.
* Elasticnet regression: the combination of ridge and lasso regression. It shrinks some coefficients toward zero (like ridge regression) and set some coefficients to exactly zero (like lasso regression).
List of Related Topics/Ideas
We won’t be able to cover all of these topics due to time, but I will provide resources and code for anyone who is interested in exploring these further or using them in their own research. I am also happy to chat more outside the workshop!
- Prediction vs Explanation in Psychology
- Overfitting
- Bias/variance tradeoff
- Test and training sets
- Cross-validation and resampling
Preparation
Watch:
StatQuest Youtube Series
- Machine learning fundamentals – bias and variance (6:35) https://www.youtube.com/watch?v=EuBBz3bI-aA
- Ridge regression clearly explained (20:26) – https://www.youtube.com/watch?v=Q81RR3yKn30
- Lasso regression clearly explained (8:18) – https://www.youtube.com/watch?v=NGf0voTMlcs&t
- Elasticnet regression clearly explains (5:18) – https://www.youtube.com/watch?v=1dKRdX9bfIo
Optional: Machine Learning Fundamentals: Cross Validation (6:04) https://www.youtube.com/watch?v=fSytzGwwBVw
Read:
- Skim the first 10 pages of Yarkoni and Westfall (2017) https://www.youtube.com/watch?v=1dKRdX9bfIo
- Read this blog post on overfitting https://www.ibm.com/cloud/learn/overfitting
Install Software:
- We will be using R and RStudio
- Install the following packages in RStudio:
install.packages(“tidyverse”)
install.packages(“tidymodels”)
install.packages(“kableExtra”)
install.packages(“skimr”)
install.packages(“naniar”)
install.packages(“doParallel”)
install.packages(“mlbench”)
install.packages(“vip”)
install.packages(“Matrix”)
install.packages(“glmnet”)
Additional Resources
Coding
- R for Data Science – https://r4ds.had.co.nz/
- Tidyverse style guide – https://style.tidyverse.org/
- Julia Silge blog – https://juliasilge.com/blog/
- Tidy modeling with R – https://www.tmwr.org/
Machine learning resources
- Introduction to statistical learning – https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6009dd9fa7bc363aa822d2c7/1611259312432/ISLR+Seventh+Printing.pdf
- Applied predictive modeling – https://vuquangnguyen2016.files.wordpress.com/2018/03/applied-predictive-modeling-max-kuhn-kjell-johnson_1518.pdf
I am looking forward to meeting all of you on Wednesday. Please don’t hesitate to reach out about anything (kpaquette2@wisc.edu). I am happy to talk about data science, PREP, Madison, grad school, and more!
Demo script and other resources can be found on Kendra’s github site:
Mixed Linear Models
Introduction to Mixed Linear Models with Melissa Schoenlein
Mixed linear models are a type of analysis used to evaluate data with non-independence that cannot otherwise be analyzed with regular linear regression.
What is non-independence/non-independent data?
Non-independence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are non-independent since they come from the same person and thus are related in ways beyond the experimental procedure (i.e. points from one participant are more likely to be more similar to each other than data points from two different participants).
Non-independence can exist beyond repeated measures at the participant level to any items occurring within “units”, including students in classrooms, family members, etc.
Why/when should I use mixed linear models?
Using regular linear regression when data is non-independent can lead to inflated Type 1 error rates, less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a non-independent relationship in the data.
List of ideas/concepts/tools that are associated with this topic
Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.
Nonindependence
Fixed versus random effects
Lme4 package in R
Preparation:
In preparation for our virtual workshop Wednesday 7/7 at 4pm CST, please watch and read the following materials.
- Watch videos 1-3, 11, and 16 from this multi-part video series providing a general overview of mixed models, when to use them, and how to interpret them (totals ~ 12 minutes). Video 11 focuses on repeated measures models, which will be the focus of our workshop.
- Skim through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them.
- Skim through this very short cheat sheet of using the lme4 package in R to analyze mixed models.
- Install the following packages in R: lme4, ggplot2
Optional additional resources if you’re interested in learning more:
Videos:
A high level video overview of mixed models (mostly framed in terms of hierarchical models). The first half of the video describes when/why someone would use these models. The second half starts to touch into the equations/math for these models.
Tutorials:
A Github repo with a 3-part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.
A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.
Papers:
This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.
Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.
Workshop will be led by Melissa Schoenlein. I can be reached at schoenlein@wisc.edu if there are any issues accessing these materials or if there are any questions (about the workshop, the PREP program, the department, or anything!). Looking forward to meeting this year’s PREPsters!
Principal Component Analysis
Introduction to Principal Component Analysis with Vince Frigo
- Read through the PCA introduction
- Try a simple PCA
- Use PCA results to plot
Optimization
Optimization with Scott Sievert
Watch Scott’s Intro Video: box.stsievert.com/prep
The following can also be found on Scott’s github: github.com/stsievert/PREP21
Hello PREP students! My learning objectives are to answer these questions:
- Why should I care about optimization?
- What are the basics of optimization? How do I get a better solution?
- Where does optimization fail?
“Optimization” is producing a model that accurately represents data, aka “fitting” a model to data. Importantly, the choice of “model” and “data” are perhaps more important than the specific method of fitting the model to the data. In short, optimization is what happens with this code:
from sklearn.linear_model import LinearRegression
estimator = LinearRegression()
# X and y are stand-ins for other data; they could easily from a CSV
X = [[1, 2], [3, 4]]
y = [3, 5]
est.fit(X, y)
In this lesson, I’ll try to open up the black box that happens when you call fit
. I’ve selected about an hour’s worth of video for you to watch, and will try to highlight some relevant issues in person.
Note: optimization is heavy in mathematics. I will try to illustrate optimization without relying on mathematics.
Background
What’s optimization?
Optimization is a process to “fit” a “model” to “data.”
- Data, typically some features and a label for each example.
- A model which will try to predict the label from a feature factor.
- A loss function that characterizes how poorly the model is performing for a specific example.
“Fitting” means “can the model accurately predict an unseen example?” Here are some good background videos on the components above:
- What’s optimization? youtube.com/watch?v=x6f5JOPhci0 (10:08) provides a general overview of optimization methods (and tradeoffs of those methods) and and some common issues in optimization in a real-world example.
- How are machine learning (ML) and optimization related? youtube.com/watch?v=NzwMV2b7jbQ (10:31) introduces ML models, and introduces how to find it. In addition the primary goal given noisy/non-standard examples?
How are models found?
The videos above provide a general overview of machine learning/optimization and a general idea of what happens inside fit
. Now, let’s get into some specifics on how to find the best model for the models mentioned in “Mixed Linear Models”:
- How is linear regression performed? youtube.com/watch?v=PaFPbb66DxQ (9:21).
- How is the minimum “loss” or “error” found in machine learning? youtube.com/watch?v=IHZwWFHWa-w (only the first 11:18 is relevant from this 21:00 video).
- Which loss function should I use? youtube.com/watch?v=fr7dfyfB7mI (6:14) steps through different use cases where different loss functions would apply. This is the most important part of ML.
This is enough background to get understand my examples. In the example, I’ll highlight some issues with optimization, included data size, noise and loss functions.
Demo
The videos above are all the material you need for the demo. To follow along for me demos, visit github.com/stsievert/PREP21/blob/master/README.md
Want to learn more?
This material is not required for the example.
Here are some other useful videos:
- How do I score a model? youtube.com/watch?v=rY5pdNW7jKM (4:40) steps through the data you should use, a critical choice. (I can talk all day about this).
- How does classification work? And how can it be modified to support more complex data? youtube.com/watch?v=-Z4aojJ-pdg&t=5m40s (12:23)
- Which model class should I use? How to choose a model class: youtube.com/watch?v=7jjzMZOdPZw (18:37)
- What’s a neural network? youtube.com/watch?v=aircAruvnKk (19:13)
Also, I would skim Chapter 7 of “Shape” by Jordan Ellenburg (23 pages). It’s light reading, and stitches a good story of optimization. The author, Jordan Ellenburg, is a mathematics professor at UW–Madison and experienced with optimization.
This Chapter is found in #prep channel in the LUCID slack workspace: wisc-lucid.slack.com
In addition, I’ve written a blog series on optimization that try to introduce the math behind optimization:
- “Least squares and regularization,” which steps through the basics of linear regression stsievert.com/blog/2015/11/19/inverse-part-1/
- “Finding sparse solutions to linear systems,” which examines a particular type of regularization (and has some fancy interactive widgets to understand what the minimization is doing) stsievert.com/blog/2015/12/09/inverse-part-2/
- “Gradient descent and physical intuition for heavy-ball acceleration with visualization”, which looks at a method to modify optimization methods. stsievert.com/blog/2016/01/30/inverse-3/
Convolutional Neural Networks
CNNs with Lowell Thompson
In this week’s session we will be learning about neural networks, focusing primarily on convolutional neural networks (CNNs). CNNs have become a useful tool for the development of self-driving cars, object and face recognition software, medical imaging analysis (e.g., MRI), and many other areas. These models can be simple to build using tools like TensorFlow and Pytorch, the latter of which we’ll use for our demo. Their inner workings, however, combine nearly all of the tools introduced throughout this workshop including linear regression, regularization, optimization, and dimensionality reduction. I hope to provide a brief introduction to CNNs, give you some hands-on experience with a pre-built model and then provide some time for discussion.
Session outline:
- Introduction to Neural Networks
- Introduction to CNNs
- Demo session with a pre-built CNN
- Discussion
Preparation for the workshop:
- Watch the following videos:
- Installation Requirements
- Python https://www.python.org
- Pytorch https://pytorch.org/
- Numpy
- Matplotlib
- Read these articles
- Pull code from GitHub
If you have trouble viewing any of the materials, please let me know (lwthompson@wisc.edu).
Resources and Sessions from 2020:
This is an accordion element with a series of buttons that open and close related content panels.
Support Vector Machines
Session 1: Support Vector Machines (SVM) with Kushin Mukherjee
Support Vector Machines (SVMs) deal with a fundamentally simple problem – how do we divide up datapoints using some form of meaningful decision boundary in a supervised learning setting? This approach gets its name from support vectors, a subset of the labeled data points whose dot products help in determining the decision boundary.
In contrast to approaches like simple neural networks or least-squares classifiers SVMs have 2 overall advantages that are important to consider together:
- They do not get stuck in local minima. If the data are linearly separable, the algorithm will always find the same ‘best’ decision boundary
- If the data aren’t linearly separable, the SVM approach supports a transformation of the dot products in a space where the data are linearly separable. This is what’s known as the ‘kernel-trick’ in SVMs.
(Note: While I do distinguish the SVM approach from simple neural networks, it has been shown that there are specific classes of neural networks that are equivalent to kernel-methods such as those in SVM. Here’s a brief summary – What are the Mathematical Relationship between Kernel Methods and Neural Networks
List of ideas/concepts/tools that are associated with this topic
- Classification
- Supervised learning
- Linear separability
- Kernel methods
Preparation for meeting:
First Watch: Patrick Winston’s lecture on SVMs is one of the easiest to follow and assumes a very minimal background in linear algebra and multivariable calculus: Youtube
Try this out second! You will need Jupyter and the necessary libraries installed. A python based implementation of SVM using scikit-learn: Stackabuse
Additional Optional Resources:
Videos:
One might like Andrew Ng’s lecture on the same from 2018, which is a bit more recent, but SVMs haven’t changed much over the past decade: Youtube (start from 46:20)
Online tutorials:
To get a stronger grasp on the mathematics behind SVMs and do some ‘hands-on’ work with them I recommend this site: SVM Tutorial
Here’s another jupyter notebook based python implementation of SVMs using scikit-learn: Learnopencv
Applied Papers:
The following is useful for seeing how these tools are used in cognitive science more broadly.
Here are 2 papers that employ SVMs in NLP and cognitive neuroscience settings
Shallow semantic parsing of sentences using SVMs: aclweb
Effective functional mapping of fMRI data using SVMs: ncbi
Theory Papers:
The original SVM paper by Vladamir Vapnik: image.diku
Jupyter Notebooks Tutorial
Jupyter Notebooks Online Tutorial with Pablo Caceres
The following is a great resource to watch/read at your own pace, and feel free to contact Pablo with any questions.
Unix Shell Tutorial
R Markdown
Introduction to R Markdown with Gaylen Fronk
R Markdown provides an authoring framework for data science in R. With a single R Markdown file, you can not only write, save, and execute your code but also communicate your process and results with an audience using high-quality, reproducible output formats.
More detail about R Markdown
R Markdown builds off tools already available in R and RStudio to provide an integrated environment for processing, coding, and communicating. An R Markdown file can include text, chunks of code, images, links, figures, and tables. While you’re working in your RStudio environment, your file operates similarly to a normal R script (a .R file) – you can write, edit, and evaluate code to work with your data. At any point, you can “knit” your file. Knitting runs, evaluates, and compiles your R Markdown file into your desired output (e.g., HTML, PDF) to create a single document that includes all the components of your written file plus the results. This knit file is ready for high-quality scientific communication with any audience. If you’ve ever seen nice examples of R code and output online, it was probably made using R Markdown.
Why should I use R Markdown?
R Markdown is particularly helpful if…
- You already work in R or RStudio and would like some additional tools at your disposal
- You value reproducible output
- You would like to be able to share your work with people who are less familiar with R (or coding more generally)
R Markdown combines the data wrangling and analytic tools of R with high-class scientific communication. It can become your one-stop-shop for sharing your data science.
Prepare for the LUCID/PREP Data Science Workshop on R Markdown:
In preparation for our video meeting next week (Wednesday 7/1 at 4pm CST), please watch, read, or review the following materials.
- Begin with this 1-minute video of what’s possible with R Markdown.
- Read Chapter 1 (Installation) from R Markdown: the Definitive Guide (Note: you should have R & RStudio installed prior to our workshop. Confirm in advance that you can open these applications.)
- Read Chapter 2 (Basics) from R Markdown: the Definitive Guide
- Read this section of Chapter 3 (Outputs: HTML) from R Markdown: The Definitive Guide
- Review this cheat sheet and have it handy for our meeting
Optional additional resources if you’re interested in learning more:
- This paper from the Statistics area of arXiv.org discusses how R Markdown can improve data science communication workflow. It’s perfect for people interested in understanding why R Markdown may be beneficial and receiving examples of its use-cases.
- This online book contains lessons on R Markdown basics, specific output formats, in-line and chunk code, tables, interactive websites, presentations, using multiple coding languages, and more. It’s perfect for someone looking for a comprehensive (yet still quite succinct) tutorial on using R markdown
- The Communication section from the R for Data Science online book includes several chapters on R markdown (the tidyverse’s preferred method for statistical and scientific communication)
- This online code from GitHub Gist provides an example/walkthrough of using R Markdown.
A note from Gaylen:
If you have questions about these materials or other questions you’d like answered during our workshop, you can submit them via this form. Please try to do this by Tuesday 6/30 at 5pm CST so that I can aggregate questions in advance.
Workshop will be led by Gaylen Fronk. You can email me at gfronk@wisc.edu if you have problems accessing these materials or installing R/RStudio. Looking forward to meeting you all!
Regression using Jupyter Notebooks
Optimization and model regularization with Owen Levin
Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems. We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.
Overview:
- linear regression as an optimization problem
- introduce loss functions
- curve fitting as optimization
- Is a perfect fit actually perfect? (wacky zero loss examples)
- model regularization
- small weights
- sparsity
Preparation:
1.If you haven’t already downloaded anaconda or another python distribution please do so.
2.View this video: Owen’s Regression Intro
3.Jupyter Notebook: Optimization & Regularization
Or check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop
Mixed Linear Models
Introduction to Mixed Linear Models with Melissa Schoenlein
Mixed linear models are a type of analysis used to evaluate data with non-independence that cannot otherwise be analyzed with regular linear regression.
What is non-independence/non-independent data?
Non-independence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are non-independent since they come from the same person and thus are related in ways beyond the experimental procedure (i.e. points from one participant are more likely to be more similar to each other than data points from two different participants).
Non-independence can exist beyond repeated measures at the participant level to any items occurring within “units”, including students in classrooms, family members, etc.
Why/when should I use mixed linear models?
Using regular linear regression when data is non-independent can lead to inflated Type 1 error rates, less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a non-independent relationship in the data.
List of ideas/concepts/tools that are associated with this topic
Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.
Nonindependence
Fixed versus random effects
Lme4 package in R
Preparation:
In preparation for our video Wednesday 7/15 at 4pm CST, please watch and read the following materials.
- Watch videos 1-3, 11, and 16 from this multi-part video series providing a general overview of mixed models, when to use them, and how to interpret them (totals ~ 12 minutes). Video 11 focuses on repeated measures models, which will be the focus of our workshop.
- Skim through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them.
- Skim through this very short cheat sheet of using the lme4 package in R to analyze mixed models.
- Install the following packages in R: lme4, ggplot2
Optional additional resources if you’re interested in learning more:
Videos:
A high level video overview of mixed models (mostly framed in terms of hierarchical models). The first half of the video describes when/why someone would use these models. The second half starts to touch into the equations/math for these models.
Tutorials:
A Github repo with a 3-part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.
A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.
Papers:
This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.
Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.
Workshop will be led by Melissa Schoenlein. I can be reached at schoenlein@wisc.edu if there are any issues accessing these materials or if there are any questions (about the workshop, the PREP program, the department, or anything!). Looking forward to meeting this year’s PREPsters!
Data Visualization with Python in Jupyter Notebooks
Data Visualization with Python in Jupyter Notebooks with Pablo Caceres
Cross Validation
Cross Validation with Sarah Sant’Ana
Cross validation is a common resampling technique used in machine learning studies. Broadly, cross validation involves splitting data into multiple training and testing subsets to increase generalizability of the model building and evaluation processes. There are multiple types of cross validation (e.g. k-fold, bootstrapped), but all serve two primary purposes:
- To select the best model configurations (e.g. what type of statistical model will perform best, which sets of features will perform best, covariate selection, hyperparameter tuning, outlier identification approaches, predictor transformations, and more).
- To evaluate the expected performance of our models in new data (i.e. on individuals who were never used in model building/selection)
Why should I use cross validation?
You should use cross validation if..
- You are fitting a statistical model with hyperparameters that need tuning (e.g. elastic-net logistic regression, random forests, svm)
- You are considering multiple combinations of model configurations (e.g. features, statistical algorithms, data transformations)
- You want to consider a large number of predictive features or you do not want to rely on theory to guide identification of predictive features
- You want to build predictive models that will generalize well to new data (i.e. you want your model to be applied in some way)
List of ideas, concepts, or tools that are associated with this topic
- R/RStudio (especially the caret package, tidymodels, and parsnip packages)
- Python
- Common types of cross validation (CV): bootstrapped CV, k-fold CV, nested CV
- Basic knowledge of linear and logistic regression
- Bias/variance trade offs in model fitting and evaluation
- Generalizability of predictive models (why its important, how to prioritize it, and how to assess it)
In preparation for our meeting next Tuesday, please review the following materials:
- For framing, please read the beginning of Yarkoni & Westfall (2017) http://jakewestfall.org/publications/Yarkoni_Westfall_choosing_prediction.pdf through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during the session – it is not necessary to have a crystal clear understanding!
- Watch this solid 6 minute explanation of cross validation by Statquest https://www.youtube.com/watch?v=fSytzGwwBVw
- Skim this “big picture” blog post that provides more clarity surrounding distinctions between model evaluation and selection A “short” introduction to model selection
During the meeting on Tuesday
- Plan on a discussion about prediction vs explanation in psychological research. I want to help you think of how you might apply cross validation in your work if you are interested 😊
- I will be walking us through the attached Cross Validation Markdown document (open this link then download, google will default to open as a g-doc that is not functional) to provide you some code for implementing cross validation. No need to read this beforehand, but you can have it open during the session if you’d like to follow along.
- Feel free to send me any questions beforehand or ask during the session! Happy to talk research, data science, or grad school as would feel beneficial to you all. My email is skittleson@wisc.edu
Additional Materials (not required, just for your reference)
Books
- Here are selected readings on cross validation from two *free* online textbooks: James et al. (2013) Chapter 5: Resampling Methods (pp 175 – 186) and Kuhn and Johnson (2018) Chapter 4: Resampling Techniques (pp 67 – 78). These books are amazing for learning about any sort of applied statistical learning – highly recommend!
Online tutorials (blogs and code examples):
- This is an R Markdown file written by the creator of the caret package in R (one of the most used machine learning packages in R to date). It explains how to tune the various types of hyperparameters using CV within carets train function. Even if you don’t plan to use R, it is helpful to see what types of parameters are tuned for different models and provides examples of creating and evaluating search grids, alternate performance metrics, and more. Model training and tuning
- This is a nice (but lengthy) R Markdown example of approaching a classic machine learning problem (product price estimation) and showcases hyperparameter tuning of a couple of different algorithms (and their comparison): Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial. This is geared towards a more advanced beginner – It still walks you through everything, but incorporates more robust data cleaning and exploration before model fitting.
Videos:
- This video is a good walkthrough using K-fold cross-validation in python to select optimal tuning parameters, choose between models, and select features: Selecting the best model in scikit-learn using cross-validation
- A short 4 minute tutorial about how to tune various types of statistical learning models within cross validation using the caret package in R. It doesn’t discuss much of the theory and is more appropriate for application focused users who are just trying to figure out how to implement parameter tuning within CV: R Tutorial – Hyperparameter tuning in caret
Papers:
- This paper describes the impact of using different CV types for parameter selection and model evaluation: Bias in error estimation when using cross-validation for model selection.This requires intermediate level understanding of using CV for parameter selection. Many people using machine learning in applied contexts are using improper CV methods that bias their model performance estimates. We should be using nested CV (or bootstrap CV with a separate validation set) if we are planning to select model parameters and generate trustworthy performance metrics
- Really cool preprint that describes sources of bias in ML resampling methods due to incorrect application in psychological research https://psyarxiv.com/2yber/. A more intermediate level read because it requires some understanding of multiple types of CV methods.
Neural Networks
Neural Networks with Ray Doudlah
Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and self-driving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, a computational model inspired by the brain.
- Introduce neural networks and their general architecture
- Introduce convolutional neural networks
- Implement a convolutional neural network to solve a hand writing recognition task
- Watch the following videos:
- Read these articles
- Pull code from GitHub
Overleaf
Overleaf by Glenn Palmer
LaTeX is a typesetting system that can be used to write academic papers and create professional-looking documents. Users type in plain text format, but mark up the text with tagging conventions, and the nicely-formatted result is shown in an output file. Overleaf is an online platform that can be used to create and edit LaTeX documents. You can share and simultaneously edit documents with collaborators, similar to the way you collaborate on a Google Doc.
For a high-level overview of LaTeX, Overleaf, and the resources below watch this video:
e
Videos
- This playlist of videos is a good starting place. They were made by a company called ShareLaTeX, which recently merged with Overleaf. These videos give a good idea of how to get started using LaTeX with an online editing system.
Online tutorials
- For more detail, and/or for a range of written tutorials, the Overleaf documentation page has a wide range of information to help get started, or to answer specific questions you might have as you get used to using LaTeX.
Cheat sheet
- For a quick reference as you’re writing, this cheat sheet includes a bunch of commands for various formatting options, with a focus on writing scientific papers.
Resources and Sessions from 2019:
This is an accordion element with a series of buttons that open and close related content panels.
Introduction to Data Science with R
Session 1: Introduction to data science with R with Tim Rogers
This session will introduce you working with data in an “integrated development environment” or IDE using the freely available and widely-used software package R. We will briefly discuss what is meant by the term “data science,” why data science is increasingly important in Psychology and Neuroscience, and how it differs from traditional statistical analysis. We will then get a sense for how IDEs work by building, from data generated in the workshop, an interactive graph showing the structure of your mental semantic network.
Preparation for the workshop: (TO DO before arriving on Tuesday!)
– Install R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students
– Start Swirl as instructed at the website and install the first course module by following the prompts
– Run yourself through the first course module
Time to complete: 45-60 minutes. Feel free to work with a partner or in groups!
Overview:
We learned how to create semantic clusters from lists of animals. Tim created this Semantic Network Demo to view the interactive graph and get the code that was used to generate the semantic clusters. The demo walks through the process of building and visualizing graphs
Using Github & Jupyter Notebooks
Session 2: Using Github, JuPyteR notebooks in several data science environments with Pablo Caceres
Preparation for the workshop:
Fitting & Evaluating Linear Models
Session 3: Fitting and evaluating linear models with John Binzak
This session will introduce you to working with linear regression models using R. We will briefly discuss why linear regression is useful for Psychology and Educational research, using the topic of numerical cognition as an example. We will play an educational game to generate our own data in the work shop, form predictions, and test those predictions by modeling gameplay performance. Through this exercise we will cover how to fit linear regression models, assess the fit of those models, plot linear relationships, and draw statistical inferences.
Preparation for the workshop:
– Be ready to uses R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students
–Install the “Regression Models” swirl module using the following commands in R
> library(swirl)
> swirl::install_course(“Regression Models”)
> swirl()
– Run yourself through lessons 1-6 (Introduction-MultiVar Examples) and continue based on your interest.
Time to complete: 45-60 minutes. Feel free to work with a partner or in groups!
Optimization & Model Regularization
Session 4: Optimization and model regularization with Owen Levin
Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems. We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.
- linear regression as an optimization problem
- introduce loss functions
- curve fitting as optimization
- Is a perfect fit actually perfect? (wacky zero loss examples)
- model regularization
- small weights
- sparsity
Preparation: If you haven’t already downloaded anaconda or another python distribution please do so.
Overview: Please check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop
Pattern Recognition & Varieties of Machine Learning
Session 5: Pattern recognition and varieties of machine learning with Ashley Hou
Owen and Ashley will be co-facilitating this session.
This session will introduce basic concepts in machine learning. We will first discuss an overview of the steps involved in the machine learning process and the two main categories of machine learning problems. Then, we will walk through examples in both supervised and unsupervised learning, specifically classification using SVMs (discussing the regularization perspective) and clustering using the k-means clustering algorithm. We will conclude with brief discussion on other popular machine learning algorithms, when to use them, and good resources to learn more.
Preparation for the workshop: 1. review session 4’s overview 2. have a working Python3 distribution, scikit-learn, matplotlib, numpy, pandas, and jupyter notebook
Cross-Validation
Session 6: Cross-validation with Sarah Sant’Ana
Today’s session will introduce the concept of cross validation. Using instructional videos from the Datacamp Machine Learning toolbox, we will walk through basic examples of cross validation in R using the caret package. We will be using two publicly available data sets in R for example code.
Our goals for this session are :
Preparation for the workshop:
– Be ready to uses R, R Studio
– Read the Yarkoni & Westfall (2017) through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during session – it is not necessary to have a crystal clear understanding!
Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100-1122.
Neural Networks
Session 7: Neural Networks with Ray Doudlah
Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and self-driving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, which basically allows the computer to analyze data similar to the way the human brain analyzes data.
With recent advancements in computing power and the explosion of big data, we can now implement large models that perform end-to-end learning (deep learning). This means that we can create a model, feed it tons and tons of data, and the model will learn features from the data that are important for accomplishing the task.
Session outline:
• Introduce the simplest neural network, the perceptron
• Discuss the general architecture for neural networks
• Implement a neural network to solve a hand writing recognition task
• Introduce deep learning (convolutional neural networks)
• Implement a deep neural network to solve a hand writing recognition task
Preparation for the workshop:
- Watch the following videos:
- Pull session 7 materials from GitHub
Bayesian Inference
Session 8: Bayesian Inference: estimating unobservable variables with Lowell Thompson
This session will focus on introducing the utility of a common statistical method known as Bayesian Inference. We’ll focus first on Bayes Theorem and learn how it relates to our understanding of perception as an inverse problem. Since the majority of research in perception relies on various psychophysical methodologies to assess behavior, we’ll also walk through how you might generate your own experiments in python using a package called Psychopy. After obtaining some data, we’ll look at a specific example that illustrates the utility of Bayesian inference in modeling our own behavioral data. Lastly, we’ll go over Bayesian inference in the broader context of data science.
Session Outline:
- Introduce Bayes Theorem
- Understand the utility of Bayesian inference in a variety of contexts
- Learn the basics of Psychopy to create basic experiments
- Use your own data from an orientation discrimination task to illustrate how Bayesian inference can be used.
Preparation: Please try and install Psychopy on your computer prior to the session, and try running one of their tutorials to make sure it works: Psychopy