BIOS-823-2020 Statistical Programming for Biomedical Big Data¶

Contents:

Python review of concepts
- Python as a language
- Coding in Python
Using numpy
- NDArray
- Indexing and slices
- Matrix multiplication
- Conditional replacement with where
- Array creating functions
- Reductions (margins)
- Broadcasting
- Universal functions (ufunc)
- Einstein summation notation
- Random moudle (c.f. Scipy)
- Linear algebra submodule (c.f. Scipy)
- Masked array
- Memory mapping
Data Science: Data processing
- Introduction to pandas
- Series and Data Frames
- Creating Data Frames
- Indexing Data Frames
- Structure of a Data Frame
- Selecting, Renaming and Removing Columns
- Selecting, Renaming and Removing Rows
- Transforming and Creating Columns
- Sorting Data Frames
- Summarizing
- Split-Apply-Combine
- Combining Data Frames
- Fixing common DataFrame issues
- Reshaping Data Frames
- Pivoting
- Functional style - apply, applymap and map
- Chaining commands
- Moving between R and Python in Jupyter
Exploratory visualization in pandas
- Using pandas-bokeh
- More controlled visualizations
- Grammar of graphics in Python
- Similar plot in seaborn
Graphics and Visualization in Python
- Pandas
- Matplotlib
- Seaborn
- plotnine
Saving and sharing data
- Serialization
- Python native data formats
- Portable data formats
- YAML
Normalization
- 1NF
- 2NF
- 3NF
Final tables
Denormalization
Relational Databases Overview
- 0. Packages for working with relational databases in Python
- Motivation
- RDBMS
- What is a database?
- Concepts
- Design
- Database administration
- CRUD
- Database normalization
- OLTP and OLAP
- Generating ER diagrams
- Robustness and scaling
SQL Queries 01
- Data
- Basic Structure
- User defined functions (UDF)
SQL Queries 02
- Create toy data set
- Subqueries
- Common table expressions (CTE)
- Window Functions
MongoDB
- Concepts
- Set up
- Get Data
- Insertion
- Queries
- Aggregate Queries
- Geospatial queries
- Indexes
Redis
- Concepts
- Connect to database
- Clear database
- Simple data types
- Complex data types
Graph concepts
- elements
- graph properties
- vertex properties
- edge properties
- Graph representations
- Some examples
- Visual representations of the same graph may look very different
Graph Algorithms
- Search
- Pathfinding
- Minimal spanning tree
- Centrality
- Community Detection
Machine Learning for Data Scientists
- A. Understanding ML
- B. ML stages
ML model examples
- Dimension reduction
Dimension Reduction
- PCA from scratch
- Geometry of PCA
- Algebra of PCA
Unsupervised Learning
- Data
- Dimension reduction
- Other dimension reduction methods
- Limitations of PCA
Clustering
- Toy example to illustrate concepts
- How to cluster
- Model selection
- Comparing across samples
Processing data
- Basic inspection
- Detailed inspection
- Create new features
- Drop features
- Inspect for missing data
- Fill in missing values for categorical values
- Tangent: catboost is nice
- Category encoding
- Split data into train and test data sets
- Category encoding
- Impute missing numeric values
- Standardize data
- Save processed data for future use
Data
- Gather data
- Baseline model
- Evaluate model families
Classification
- Stacking
- Create a model
- Optimize model
- Confusion matrix
- ROC curve
- Precision-recall curve
- Learning curve
- Model persistence (and deploymnet)
Imbalanced data
- Simulate an imbalanced data set
- Collect more data
- Use evaluation metrics that are less sensitive to imbalance
- Over-sample the minority class
- Under-sample the majority class
- Combine over- and under-sampling
- Use class weights to adjust the loss function
- Use a classifier that is less sensitive to imbalance
Hyperparameter tuning
- Load data
- Using skelearn
- Using scikit-otpimize
- Using optuna
- Using pycaret
Interpretable ML
- Interpretable models
- Intrinsically interpretable models
- Model Agnostic Methods
Functional programming in Python (operator, functional, itertoools, toolz)
- Pure functions
- Recursive functions
- Anonymous functions
- Lazy evaluation
- Higher order functions
- Example: Flattening a nested list
- Closure
- Decorators
- Partial application
- Using operator
- Using functional
- Using itertools
- Using toolz
Tensorflow
- Working with tensors
- Tensorflow proability
- Regression
- Tenssorflow Data
Tensorflow
- Keras
- Building blocks
Deep Learning Models
- Building models
- Built-in models and transfer learning
- Custom methods
- Hyperparameter optimization
- Interpretable deep learning
Deep Learning Models
- Building models
- Hyperparameter optimization
Spark Low Level API
- Outline
- Architecture of a Spark Application
- SparkContext
- Resilient Distributed Datasets (RDD)
- Actions and transforms
- Working with key-value pairs
- Persisting data
Spark High-Level API
- Overview of Spark SQL
- Create a Spark DataFrame
- Data manipulation
- Filter
- Sort
- Transform
- Sumarize
- Group by
- Window functions
- SQL
- String operatons
- UDF
- Joins
- DataFrame conversions
Using Spark Efficiently
- Resources
- Shared variables
- The Spark Shuffle and Partitioning
Spark SQL
- Standard SQL Queries
- Functions
- Complex Types
Spark GraphFrames
- Utility plotting function
- Nodes and edges are just DataFrames
- Motifs
- Load graph
- Shortest paths to landmarks
- Saving and loading GraphFrames
Spark MLLib
- Set up Spark and Spark SQL contexts
- Vectors
- Manual construction of an ML DaataFrame
- Using VectorAssembler
- Generating simple statistics
- Split data
- Encoding categorical features
- Scaling
- Dimension reduction
- Clustering
- Model evaluation
- Pipelines
- Hyper-parameter optimization
- Using spark with a non-MLLib classifier
Spark Structured Streaming
- Data set
- Start Spark
- Review of standard Spark processing with DataFrames and SQL
- Structured Streaming Concepts
- Structured Streaming code
- Transformations on streams

Indices and tables¶