Course Outline
Day 1
Python refresher [2 hrs]
The Python interpreter
Python Data Types
Data and type introspection basics
Control structures
Functions
Classes
Errors and exceptions
Regular expressions
Data Analytics
Data Ecosystem in Python [2 hrs]
Scipy
Numpy
Pandas
Matplotlib
Ipython
Jupyter
numpy [2 hr]
C-python integration
C data in python
Numpy arrays
Dtype’s
Shape
Reshape
Numpy array operations/operators
Numpy ‘mapped’ functions
Run-time comparison with python lists, etc
Day 2
Pandas [2.5 hr]
Tabular data
DataFrames
Series
Index’s
Importing data: from_csv, from_xls, from_json, from_avro
Exporting data: to_csv, etc
DataFrame row filtering operations
DataFrame str functions
Setting Indexes, multi-index dataframes
Sorting: sort_value(), sort_index()
group_by
Stack, unstack
Pivot() and pivot_table()
Time series
Re-sampling
interpolation
Side-by-side comparisons
Integration with matplotlib
Key plotting attributes and args
Handling NaN’s: fillna(), etc
Data Visualization : Matplotlib, pyplot [2.5 hr]
Types of charts
Line Plot
Scatter Plot
Bar Charts
Histograms
Pie charts
Box plots
Candle plot for financial data
Chart attributes: axes, grid, legends, title
Colors, gradients
Multiple plots and figures
Axes of plots
Numpy integration with pyplot
Pandas integration with pyplot
Scipy [1.5 hrs]
Non-standard data-types and scipy
Scipy and numpy ndarray
Scipy.stats
scipy.interpolate
Statistics concepts
Central Tendency
Spread
Mean, median, mode
Quartiles
Rolling averages
Interpolation
Distributions
Curve Fitting
Root Mean Squares
Day 3
Scipy.weave [1.5 hrs]
C/C integration: weave
Weave.inline()
weave.blitz()
SWIG
weave.ext_tools()
C code as python strings
Blitz_type_factories
Scalar_spec
Weave parser and translate_symbols()
Benchmarking
Machine Learning
Intro & Setup [.5 hrs]
Unsupervised and supervised learning
Scikit
Scikit learn (sklearn)
High level patterns in the classes and API’s
fit()
transform()
predict()
score()
Classification [2 hrs]
Introduction to idea of observation based learning
Distances and similarities
k Nearest Neighbours (kNN) for classification
Regression with kNN & SVM
Focus on (Support Vector Machines) SVM Kernels and their use
Regression [1 hrs]
Linear Regression
Regularization of Generalized Linear Models
Logistic Regression
Methods of threshold determination and performance measures for classification score models
Unsupervised learning [2 hrs]
Need for dimensionality reduction
Principal Component Analysis (PCA)
Difference between PCAs and Latent Factors
Factor Analysis
Hierarchical, K-means & DBSCAN Clustering, Gaussian Mixture Models
SVD
Clustering Use Cases
Day 4
Tree Models [2 hrs]
Introduction to decision trees
Tuning tree size with cross validation
Introduction to bagging algorithm
Random Forests
Grid search and randomized grid search
ExtraTrees (Extremely Randomised Trees)
Partial dependence plots
Intro to Boosting Algorithms [1.5 hrs]
Ensemble Learning
Concept of weak learners
Introduction to boosting algorithms
Adaptive Boosting
Natural Language Processing
Tokenization [1 hr]
Regular Expressions with re module
re.search() and re.findall()
re.split()
Nltk.tokenize
word_tokenize()
sent_tokenize()
non-ASCII tokenization
Topic Identification [1 hr]
Word counting
Introducing corpora
Gensim
Bag-of-words
Introducting TF-IDF
TF-IDF with genism
Day 5
Named Entity Recognition [2 hr]
NER with nltk
Stanford Library with NLTK NER
SpaCy
SpaCy vs nltk
SpaCy NER categories
polyglot: multilingual NER
Exercise: french and spanish NER
NLTK for classification [1.5 hr]
Feature extraction
Train and test sets
CountVectorizer
TfIdfVectorizer
Exercise: fake news detector
Web Scraping [2 hrs]
BeautifulSoup module
Bs4 module
prettify()
HTML tags overview: <head>, <body>, <h1>, <a href>, <title>, <p>, ..
Tag properties
DOM
Object attributes
.title
.p
.parent
.children
.name
.contents
.strings
Dict based lookup
Soup[‘id’]
Multi-valued attributes
.find()
.find_all()
.get()
Day 6 (Optional)
Distributed Applications: Hadoop & Spark
[working knowledge of the following assumed:
Hadoop Architecture, HDFS, Map-Reduce
Pyspark sub-modules: sql, streaming, ml, MLlib
RDD’s, DataFrames, DataSets
pyDoop [1.5 hrs]
Pydoop.hdfs API
Mappers, reducers and combiners
Pipes
Record readers and writers
Partitioners and Combiners
Pydoop command line
Simulator API
Spark Data Processing Use Cases [1.5 hrs]
Graph Processing and Analysis
pySpark.ml and pySpark.MLlib
Example: k-means
Spark Applications with over Hadoop [2 hrs]
Spark Applications vs. Spark Shell
Importing modules on executor nodes
Complex dependencies: native code in egg’s
Heterogenous cluster complexities and solutions
--pyfiles and addPyFiles()
Virtualenv’s
ClusterSSH & ParallelSSH
Anaconda cluster
Preview: Spark SQL [1 hr]
Spark SQL and the SQL Context
Creating DataFrames
Transforming and Querying DataFrames
Saving, restoring DataFrames
De-brief [1 hr]
Suggested approaches for digging deeper
Avoiding confusions
Conquering complexity with isolation
Future references
Summary, wrap-up, Q&A [1 hrs]