learning.understanding.cognition.intelligence.data science

Data Science Workshop

Psychology Research Experience Program (PREP) provides mentoring and experience to undergraduates who have an interest in a scientific psychology career. LUCID partnered with PREP to create a hands-on data science workshop series. LUCID graduate students will facilitate the data science workshops.

This virtual data science workshop will be held on Wednesdays from 4-5p starting June 16th, 2021 look for the meeting link in your email.



In the workshops students will be introduced to data-science environments, concepts, and applications, through the use of JuPyteR notebooks. LUCID facilitators will introduce a series of data science concepts via online materials and hands-on virtual sessions. Students will work through examples and demos in the notebook environment, with guidance from LUCID graduate students.


For PREP students to gain a sense of  1) how to work with an R or Python integrated development environment, 2) the kinds of things one can do with a range of data-science tools, and 3) how to continue learning about and working with these tools in the future. Note that the goal will not be specifically to teach programming in R, Python, or any other language, but how to work interactively with and adapt notebooks that carry out common data-science tasks, and to get a general sense of what the methods are used for and how they might be applied to one’s own data.



Materials and Session Outlines:

This will be updated with materials and facilitator outlines as they become available.


Resources and Sessions from 2020:

This is an accordion element with a series of buttons that open and close related content panels.

Support Vector Machines

Session 1: Support Vector Machines (SVM) with Kushin Mukherjee

Support Vector Machines (SVMs) deal with a fundamentally simple problem – how do we divide up datapoints using some form of meaningful decision boundary in a supervised learning setting? This approach gets its name from support vectors, a subset of the labeled data points whose dot products help in determining the decision boundary.


In contrast to approaches like simple neural networks or least-squares classifiers SVMs have 2 overall advantages that are important to consider together:

  1. They do not get stuck in local minima. If the data are linearly separable, the algorithm will always find the same ‘best’ decision boundary
  2. If the data aren’t linearly separable, the SVM approach supports a transformation of the dot products in a space where the data are linearly separable. This is what’s known as the ‘kernel-trick’ in SVMs.

(Note: While I do distinguish the SVM approach from simple neural networks, it has been shown that there are specific classes of neural networks that are equivalent to kernel-methods such as those in SVM. Here’s a brief summary – What are the Mathematical Relationship between Kernel Methods and Neural Networks

List of ideas/concepts/tools that are associated with this topic

  • Classification
  • Supervised learning
  • Linear separability
  • Kernel methods

Preparation for meeting: 

First Watch: Patrick Winston’s lecture on SVMs is one of the easiest to follow and assumes a very minimal background in linear algebra and multivariable calculus: Youtube

Try this out second! You will need Jupyter and the necessary libraries installed. A python based implementation of SVM using scikit-learn: Stackabuse

Additional Optional Resources: 


One might like Andrew Ng’s lecture on the same from 2018, which is a bit more recent, but SVMs haven’t changed much over the past decade: Youtube (start from 46:20)

Online tutorials:

To get a stronger grasp on the mathematics behind SVMs and do some ‘hands-on’ work with them I recommend this site: SVM Tutorial

Here’s another jupyter notebook based python implementation of SVMs using scikit-learn: Learnopencv

Applied Papers:

The following is useful for seeing how these tools are used in cognitive science more broadly.

Here are 2 papers that employ SVMs in NLP and cognitive neuroscience settings

Shallow semantic parsing of sentences using SVMs: aclweb

Effective functional mapping of fMRI data using SVMs: ncbi

Theory Papers:

The original SVM paper by Vladamir Vapnik: image.diku 

Jupyter Notebooks Tutorial

Jupyter Notebooks Online Tutorial with Pablo Caceres

The following is a great resource to watch/read at your own pace, and feel free to contact Pablo with any questions.

Blogpost format (with video-lessons embedded)
Video-lesson format playlist with Pablo’s explanations
Jupyter Notebook format on GitHub

Unix Shell Tutorial

UNIX Shell Tutorial with Pablo Caceres
Blogpost Format (dark background): https://pabloinsente.github.io/intro-unix-shell
There are instructions to follow along for Windows and Mac/Linux users, and an online option too. It is optional to follow along, you can just read if you would like to do so.
Here is a presentation with resources for Shell, Git and IDE’s by Pablo Caceres: Things that are good to know for Data Science Beginners  

R Markdown

Introduction to R Markdown with Gaylen Fronk

R Markdown provides an authoring framework for data science in R. With a single R Markdown file, you can not only write, save, and execute your code but also communicate your process and results with an audience using high-quality, reproducible output formats. 


More detail about R Markdown

R Markdown builds off tools already available in R and RStudio to provide an integrated environment for processing, coding, and communicating. An R Markdown file can include text, chunks of code, images, links, figures, and tables. While you’re working in your RStudio environment, your file operates similarly to a normal R script (a .R file) – you can write, edit, and evaluate code to work with your data. At any point, you can “knit” your file. Knitting runs, evaluates, and compiles your R Markdown file into your desired output (e.g., HTML, PDF) to create a single document that includes all the components of your written file plus the results. This knit file is ready for high-quality scientific communication with any audience. If you’ve ever seen nice examples of R code and output online, it was probably made using R Markdown.


Why should I use R Markdown? 

R Markdown is particularly helpful if…

  • You already work in R or RStudio and would like some additional tools at your disposal
  • You value reproducible output
  • You would like to be able to share your work with people who are less familiar with R (or coding more generally)

R Markdown combines the data wrangling and analytic tools of R with high-class scientific communication. It can become your one-stop-shop for sharing your data science.


Prepare for the LUCID/PREP Data Science Workshop on R Markdown:

In preparation for our video meeting next week (Wednesday 7/1 at 4pm CST), please watch, read, or review the following materials.

  1. Begin with this 1-minute video of what’s possible with R Markdown.
  2. Read Chapter 1 (Installation) from R Markdown: the Definitive Guide (Note: you should have R & RStudio installed prior to our workshop. Confirm in advance that you can open these applications.)
  3. Read Chapter 2 (Basics) from R Markdown: the Definitive Guide
  4. Read this section of Chapter 3 (Outputs: HTML) from R Markdown: The Definitive Guide
  5. Review this cheat sheet and have it handy for our meeting


Optional additional resources if you’re interested in learning more:

  • This paper from the Statistics area of arXiv.org discusses how R Markdown can improve data science communication workflow. It’s perfect for people interested in understanding why R Markdown may be beneficial and receiving examples of its use-cases. 
  • This online book contains lessons on R Markdown basics, specific output formats, in-line and chunk code, tables, interactive websites, presentations, using multiple coding languages, and more. It’s perfect for someone looking for a comprehensive (yet still quite succinct) tutorial on using R markdown
  • The Communication section from the R for Data Science online book includes several chapters on R markdown (the tidyverse’s preferred method for statistical and scientific communication) 
  • This online code from GitHub Gist provides an example/walkthrough of using R Markdown.

A note from Gaylen:

If you have questions about these materials or other questions you’d like answered during our workshop, you can submit them via this form. Please try to do this by Tuesday 6/30 at 5pm CST so that I can aggregate questions in advance.

Workshop will be led by Gaylen Fronk. You can email me at gfronk@wisc.edu if you have problems accessing these materials or installing R/RStudio. Looking forward to meeting you all!

Regression using Jupyter Notebooks

Optimization and model regularization with Owen Levin

Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems.  We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.


  • linear regression as an optimization problem
    • introduce loss functions
  • curve fitting as optimization
  • Is a perfect fit actually perfect? (wacky zero loss examples)
  • model regularization
    • small weights
    • sparsity


1.If you haven’t already downloaded anaconda or another python distribution please do so.

2.View this video: Owen’s Regression Intro

3.Jupyter Notebook: Optimization & Regularization

Or check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop


Mixed Linear Models

Introduction to Mixed Linear Models with Melissa Schoenlein

Mixed linear models are a type of analysis used to evaluate data with non-independence that cannot otherwise be analyzed with regular linear regression.

What is non-independence/non-independent data?

Non-independence occurs when two or more data are connected (correlated) in some way. For example, you run an experiment collecting ratings on interest in math. Your participants make these ratings at the start of the semester, in the middle of the semester, and then again at the end of the semester. Each of these participants has three data points. These data points are non-independent since they come from the same person and thus are related in ways beyond the experimental procedure (i.e. points from one participant are more likely to be more similar to each other than data points from two different participants).

Non-independence can exist beyond repeated measures at the participant level to any items occurring within “units”, including students in classrooms, family members, etc.

Why/when should I use mixed linear models?

Using regular linear regression when data is non-independent can lead to inflated Type 1 error rates, less statistical power, and potentially inaccurate effects. A mixed linear model should be used anytime there is a non-independent relationship in the data.

List of ideas/concepts/tools that are associated with this topic

Hierarchical modeling, mixed modeling, linear mixed effects models, multilevel models, etc.


Fixed versus random effects

Lme4 package in R


In preparation for our video Wednesday 7/15 at 4pm CST, please watch and read the following materials.

  1. Watch videos 1-3, 11, and 16 from this multi-part video series providing a general overview of mixed models, when to use them, and how to interpret them (totals ~ 12 minutes). Video 11 focuses on repeated measures models, which will be the focus of our workshop.
  2. Skim through this online tutorial to that provides a walkthrough of code and output of basic linear mixed effects models in R and why we use them.
  3. Skim through this very short cheat sheet of using the lme4 package in R to analyze mixed models.
  4. Install the following packages in R: lme4, ggplot2

Optional additional resources if you’re interested in learning more:


A high level video overview of mixed models (mostly framed in terms of hierarchical models). The first half of the video describes when/why someone would use these models. The second half starts to touch into the equations/math for these models.


A Github repo with a 3-part workshop aimed at providing tutorials and exercises to learn how to do mixed models in R. The first part is a general intro to R. The second part is about statistical modeling (generally) in R. Then part 3 is mixed models in R.

A similar, but less comprehensive, tutorial demonstrating mixed models in both R and Python.


This paper provides guidelines for how to create linear mixed effects models, including steps on how to decide what random effects to include and how to address convergence issues with a large number of parameters.

Jake Westfall, a former quantitative psychologist that now works in data science/analytics in industry, has curated a list of 13 helpful readings on mixed linear models.



Workshop will be led by Melissa Schoenlein. I can be reached at schoenlein@wisc.edu if there are any issues accessing these materials or if there are any questions (about the workshop, the PREP program, the department, or anything!). Looking forward to meeting this year’s PREPsters!

Data Visualization with Python in Jupyter Notebooks

Data Visualization with Python in Jupyter Notebooks with Pablo Caceres

In this tutorial I will introduce Altair, which is a declarative statistical visualization library for Python based on Vega and Vega-Lite.
Altair provides an elegant and consistent API for statistical graphics. This library is built on top of the Vega-Lite high-level grammar for interactive graphics which is based on the “grammar of graphics” idea proposed by Leland Wilkinson. Altair key strength is the provision of a clear mental model based on a set of graphical primitives and carefully designed combinatorial rules, that yield an ample space of graphical displays, avoiding the constraints of chart taxonomies.
Optional resources
We likely will not have enough time to follow along during the workshop, but you can find instructions to either run the examples online or to install the required packages and run the examples locally here:   https://github.com/pabloinsente/pydata_altair_tutorial


Cross Validation

Cross Validation with Sarah Sant’Ana

Cross validation is a common resampling technique used in machine learning studies. Broadly, cross validation involves splitting data into multiple training and testing subsets to increase generalizability of the model building and evaluation processes. There are multiple types of cross validation (e.g. k-fold, bootstrapped), but all serve two primary purposes:

  • To select the best model configurations (e.g. what type of statistical model will perform best, which sets of features will perform best, covariate selection, hyperparameter tuning, outlier identification approaches, predictor transformations, and more).
  • To evaluate the expected performance of our models in new data (i.e. on individuals who were never used in model building/selection)


Why should I use cross validation? 

You should use cross validation if..

  • You are fitting a statistical model with hyperparameters that need tuning (e.g. elastic-net logistic regression, random forests, svm)
  • You are considering multiple combinations of model configurations (e.g.  features, statistical algorithms, data transformations)
  • You want to consider a large number of predictive features or you do not want to rely on theory to guide identification of predictive features
  • You want to build predictive models that will generalize well to new data (i.e. you want your model to be applied in some way)


List of ideas, concepts, or tools that are associated with this topic

  • R/RStudio (especially the caret package, tidymodels, and parsnip packages)
  • Python
  • Common types of cross validation (CV): bootstrapped CV, k-fold CV, nested CV
  • Basic knowledge of linear and logistic regression
  • Bias/variance trade offs in model fitting and evaluation
  • Generalizability of predictive models (why its important, how to prioritize it, and how to assess it)


In preparation for our meeting next Tuesday, please review the following materials:

During the meeting on Tuesday

  • Plan on a discussion about prediction vs explanation in psychological research. I want to help you think of how you might apply cross validation in your work if you are interested 😊
  • I will be walking us through the attached Cross Validation Markdown document (open this link then download, google will default to open as a g-doc that is not functional) to provide you some code for implementing cross validation. No need to read this beforehand, but you can have it open during the session if you’d like to follow along.
  • Feel free to send me any questions beforehand or ask during the session! Happy to talk research, data science, or grad school as would feel beneficial to you all. My email is skittleson@wisc.edu


Additional Materials (not required, just for your reference)


Online tutorials (blogs and code examples):

  • This is an R Markdown file written by the creator of the caret package in R (one of the most used machine learning packages in R to date). It explains how to tune the various types of hyperparameters using CV within carets train function. Even if you don’t plan to use R, it is helpful to see what types of parameters are tuned for different models and provides examples of creating and evaluating search grids, alternate performance metrics, and more. Model training and tuning
  • This is a nice (but lengthy) R Markdown example of approaching a classic machine learning problem (product price estimation) and showcases hyperparameter tuning of a couple of different algorithms (and their comparison): Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial. This is geared towards a  more advanced beginner – It still walks you through everything, but incorporates more robust data cleaning and exploration before model fitting.


  • This video is a good walkthrough using K-fold cross-validation in python to select optimal tuning parameters, choose between models, and select features: Selecting the best model in scikit-learn using cross-validation
  • A short 4 minute tutorial about how to tune various types of statistical learning models within cross validation using the caret package in R. It doesn’t discuss much of the theory and is more appropriate for application focused users who are just trying to figure out how to implement parameter tuning within CV: R Tutorial – Hyperparameter tuning in caret


  • This paper describes the impact of using different CV types for parameter selection and model evaluation: Bias in error estimation when using cross-validation for model selection.This requires intermediate level understanding of using CV for parameter selection. Many people using machine learning in applied contexts are using improper CV methods that bias their model performance estimates. We should be using nested CV (or bootstrap CV with a separate validation set) if we are planning to select model parameters and generate trustworthy performance metrics
  • Really cool preprint that describes sources of bias in ML resampling methods due to incorrect application in psychological research https://psyarxiv.com/2yber/. A more intermediate level read because it requires some understanding of multiple types of CV methods.

Neural Networks

Neural Networks with Ray Doudlah

Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and self-driving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, a computational model inspired by the brain.

With recent advancements in computing power and the explosion of big data, we can now implement large models that are capable of learning how to accomplish a task by itself, by only looking at the data you feed it. These deep learning models learn to extract features that the model finds important to help it accomplish the task.
In this week’s session we will be learning about neural networks, and get to play with a convolutional neural network, a model that is used in machine vision, object recognition, and self-driving cars. The topic of neural networks is very broad, so my goal is to give you a brief overview and provide you with enough resources so that you can learn more about specific models that may be applicable to your scientific work. I also want to give you some hands-on practice with running a pre-built model so you can get an intuition for what these models are doing under the hood.
Session outline:
  • Introduce neural networks and their general architecture
  • Introduce convolutional neural networks
  • Implement a convolutional neural network to solve a hand writing recognition task
Preparation for the workshop:


Overleaf by Glenn Palmer

LaTeX is a typesetting system that can be used to write academic papers and create professional-looking documents. Users type in plain text format, but mark up the text with tagging conventions, and the nicely-formatted result is shown in an output file. Overleaf is an online platform that can be used to create and edit LaTeX documents. You can share and simultaneously edit documents with collaborators, similar to the way you collaborate on a Google Doc.

For a high-level overview of LaTeX, Overleaf, and the resources below watch this video:



  • This playlist of videos is a good starting place. They were made by a company called ShareLaTeX, which recently merged with Overleaf. These videos give a good idea of how to get started using LaTeX with an online editing system.

Online tutorials

  • For more detail, and/or for a range of written tutorials, the Overleaf documentation page has a wide range of information to help get started, or to answer specific questions you might have as you get used to using LaTeX.

Cheat sheet

  • For a quick reference as you’re writing, this cheat sheet includes a bunch of commands for various formatting options, with a focus on writing scientific papers.

Resources and Sessions from 2019:

This is an accordion element with a series of buttons that open and close related content panels.

Introduction to Data Science with R

Session 1: Introduction to data science with R with Tim Rogers

This session will introduce you working with data in an “integrated development environment” or IDE using the freely available and widely-used software package R. We will briefly discuss what is meant by the term “data science,” why data science is increasingly important in Psychology and Neuroscience, and how it differs from traditional statistical analysis. We will then get a sense for how IDEs work by building, from data generated in the workshop, an interactive graph showing the structure of your mental semantic network.

Preparation for the workshop: (TO DO before arriving on Tuesday!)
– Install R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students
– Start Swirl as instructed at the website and install the first course module by following the prompts
– Run yourself through the first course module

Time to complete: 45-60 minutes. Feel free to work with a partner or in groups!

We learned  how to create semantic clusters from lists of animals. Tim created this Semantic Network Demo to view the interactive graph and get the code that was used to generate the semantic clusters. The demo walks through the process of building and visualizing graphs

Using Github & Jupyter Notebooks

Session 2: Using Github, JuPyteR notebooks in several data science environments with Pablo Caceres

In this session we will set up several data science tools and environments: ATOM text editor, Python with Anaconda, Jupyter Notebooks/lab, IRKernel (to run R on Jupyter), Git (Mac/Linux) or GitBash (Windows), GitHub account, GitHub Repository, and a folder system. Then we will go over the basics of how to open, run and test each tool.

Preparation for the workshop:

– Download and install Atom text editor from atom.io
– Download and install Git* from git-scm.com/downloads
Windows users: when installing git, make sure you have the ‘GitBash Here’ selected.
Pablo created a Github Repository for the workshop. If you click on session 2 you will find all the topics that Pablo covered in this session. Stay tuned as we plan to update this repository with more content.

Fitting & Evaluating Linear Models

Session 3: Fitting and evaluating linear models with John Binzak

This session will introduce you to working with linear regression models using R. We will briefly discuss why linear regression is useful for Psychology and Educational research, using the topic of numerical cognition as an example.  We will play an educational game to generate our own data in the work shop, form predictions, and test those predictions by modeling gameplay performance. Through this exercise we will cover how to fit linear regression models, assess the fit of those models, plot linear relationships, and draw statistical inferences.

Preparation for the workshop: 
– Be ready to uses R, R Studio, and Swirl on your laptop following the instructions here: swirlstats.com/students
–Install the “Regression Models” swirl module using the following commands in R
> library(swirl)
> swirl::install_course(“Regression Models”)
> swirl()

– Run yourself through lessons 1-6 (Introduction-MultiVar Examples) and continue based on your interest.

Time to complete: 45-60 minutes. Feel free to work with a partner or in groups!

Optimization & Model Regularization

Session 4: Optimization and model regularization with Owen Levin

Linear regression and many other model fitting problems can be viewed mathematically as solutions to optimization problems.  We’ll explore how this can help generalize our models as well as how we can introduce regularization to emphasize fitting models with special properties.

  • linear regression as an optimization problem
    • introduce loss functions
  • curve fitting as optimization
  • Is a perfect fit actually perfect? (wacky zero loss examples)
  • model regularization
    • small weights
    • sparsity

Preparation: If you haven’t already downloaded anaconda or another python distribution please do so.

Overview: Please check out the github repository for session 4 on Optimization and Regularization. github.com/pabloinsente/LUCID_data_workshop

Pattern Recognition & Varieties of Machine Learning

Session 5: Pattern recognition and varieties of machine learning with Ashley Hou

Owen and Ashley will be co-facilitating this session.

This session will introduce basic concepts in machine learning. We will first discuss an overview of the steps involved in the machine learning process and the two main categories of machine learning problems. Then, we will walk through examples in both supervised and unsupervised learning, specifically classification using SVMs (discussing the regularization perspective) and clustering using the k-means clustering algorithm. We will conclude with brief discussion on other popular machine learning algorithms, when to use them, and good resources to learn more.

Preparation for the workshop: 1. review session 4’s overview 2. have a working Python3 distribution, scikit-learn, matplotlib, numpy, pandas, and jupyter notebook


Session 6: Cross-validation with Sarah Sant’Ana

Today’s session will introduce the concept of cross validation. Using instructional videos from the Datacamp Machine Learning toolbox, we will walk through basic examples of cross validation in R using the caret package. We will be using two publicly available data sets in R for example code.

Our goals for this session are :

1) Learn why cross validation is important
2) Learn the basic steps of k-fold cross validation and repeated k-fold cross validation
3) Provide you with basic code to use on your own

Preparation for the workshop:

– Be ready to uses R, R Studio

– Read the Yarkoni & Westfall (2017) through page 5. You can stop reading at “Balancing Flexibility and Robustness: Basic Principles of Machine Learning.” The purpose of this article is just to get you thinking about the discussion we will have during session – it is not necessary to have a crystal clear understanding!

Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100-1122.

Neural Networks

Session 7: Neural Networks with Ray Doudlah

Machine learning and artificial intelligence technology is growing at an impressive rate. From robotics and self-driving cars to augmented reality devices and facial recognition software, models that make predictions from data are all around us. Many of these applications implement neural networks, which basically allows the computer to analyze data similar to the way the human brain analyzes data.

With recent advancements in computing power and the explosion of big data, we can now implement large models that perform end-to-end learning (deep learning). This means that we can create a model, feed it tons and tons of data, and the model will learn features from the data that are important for accomplishing the task.

Session outline:
• Introduce the simplest neural network, the perceptron
• Discuss the general architecture for neural networks
• Implement a neural network to solve a hand writing recognition task
• Introduce deep learning (convolutional neural networks)
• Implement a deep neural network to solve a hand writing recognition task

Preparation for the workshop:

  1. Watch the following videos:
  2. Pull session 7 materials from GitHub


Bayesian Inference

Session 8: Bayesian Inference: estimating unobservable variables with Lowell Thompson

This session will focus on introducing the utility of a common statistical method known as Bayesian Inference. We’ll focus first on Bayes Theorem and learn how it relates to our understanding of perception as an inverse problem. Since the majority of research in perception relies on various psychophysical methodologies to assess behavior, we’ll also walk through how you might generate your own experiments in python using a package called Psychopy. After obtaining some data, we’ll look at a specific example that illustrates the utility of Bayesian inference in modeling our own behavioral data. Lastly, we’ll go over Bayesian inference in the broader context of data science.

Session Outline:

  1. Introduce Bayes Theorem
  2. Understand the utility of Bayesian inference in a variety of contexts
  3. Learn the basics of Psychopy to create basic experiments
  4. Use your own data from an orientation discrimination task to illustrate how Bayesian inference can be used.

Preparation: Please try and install Psychopy on your computer prior to the session, and try running one of their tutorials to make sure it works: Psychopy