Python for Data Science

Course Overview

A Data Scientist combines statistical and machine learning techniques with Python programming to analyze and interpret complex data.
This course will establish your expertise in data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and gain deep knowledge in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.

Learn to visualize real data with matplotlib's functions and get to know new data structures such as the dictionary and the Pandas Dataframe. After covering key concepts such as Boolean logic, control flow and loops in Python, you're ready to blend together everything you've learned to solve a case study using hacker statistics.

Course Schedule

Request Group Training

Target Audience

Analytics professionals who want to work with Python
Software professionals looking to get into the field of analytics
IT professionals interested in pursuing a career in analytics
Graduates looking to build a career in analytics and data science
Experienced professionals who would like to harness data science in their fields
Anyone with a genuine interest in the field of data science

Course Objectives

The Data Science with Python course will furnish you with in-depth knowledge of the various libraries and packages required to perform data analysis, data visualization, web scraping, machine learning and natural language processing using Python.

Python has surpassed Java as the top language used to introduce US students to programming and computer science, and 46 percent of data science jobs list Python as a required skill.

Course Prerequisites

Python basics

Course Outline

Day 1

Python refresher [2 hrs]

The Python interpreter

Python Data Types

Data and type introspection basics

Control structures

Functions

Classes

Errors and exceptions

Regular expressions

Data Analytics

Data Ecosystem in Python [2 hrs]

Scipy

Numpy

Pandas

Matplotlib

Ipython

Jupyter

numpy [2 hr]

C-python integration

C data in python

Numpy arrays

Dtype’s

Shape

Reshape

Numpy array operations/operators

Numpy ‘mapped’ functions

Run-time comparison with python lists, etc

Day 2

Pandas [2.5 hr]

Tabular data

DataFrames

Series

Index’s

Importing data: from_csv, from_xls, from_json, from_avro

Exporting data: to_csv, etc

DataFrame row filtering operations

DataFrame str functions

Setting Indexes, multi-index dataframes

Sorting: sort_value(), sort_index()

group_by

Stack, unstack

Pivot() and pivot_table()

Time series

Re-sampling

interpolation

Side-by-side comparisons

Integration with matplotlib

Key plotting attributes and args

Handling NaN’s: fillna(), etc

Data Visualization : Matplotlib, pyplot [2.5 hr]

Types of charts

Line Plot

Scatter Plot

Bar Charts

Histograms

Pie charts

Box plots

Candle plot for financial data

Chart attributes: axes, grid, legends, title

Colors, gradients

Multiple plots and figures

Axes of plots

Numpy integration with pyplot

Pandas integration with pyplot

Scipy [1.5 hrs]

Non-standard data-types and scipy

Scipy and numpy ndarray

Scipy.stats

scipy.interpolate

Statistics concepts

Central Tendency

Spread

Mean, median, mode

Quartiles

Rolling averages

Interpolation

Distributions

Curve Fitting

Root Mean Squares

Day 3

Scipy.weave [1.5 hrs]

C/C integration: weave

Weave.inline()

weave.blitz()

SWIG

weave.ext_tools()

C code as python strings

Blitz_type_factories

Scalar_spec

Weave parser and translate_symbols()

Benchmarking

Machine Learning

Intro & Setup [.5 hrs]

Unsupervised and supervised learning

Scikit

Scikit learn (sklearn)

High level patterns in the classes and API’s

fit()

transform()

predict()

score()

Classification [2 hrs]

Introduction to idea of observation based learning

Distances and similarities

k Nearest Neighbours (kNN) for classification

Regression with kNN & SVM

Focus on (Support Vector Machines) SVM Kernels and their use

Regression [1 hrs]

Linear Regression

Regularization of Generalized Linear Models

Logistic Regression

Methods of threshold determination and performance measures for classification score models

Unsupervised learning [2 hrs]

Need for dimensionality reduction

Principal Component Analysis (PCA)

Difference between PCAs and Latent Factors

Factor Analysis

Hierarchical, K-means & DBSCAN Clustering, Gaussian Mixture Models

SVD

Clustering Use Cases

Day 4

Tree Models [2 hrs]

Introduction to decision trees

Tuning tree size with cross validation

Introduction to bagging algorithm

Random Forests

Grid search and randomized grid search

ExtraTrees (Extremely Randomised Trees)

Partial dependence plots

Intro to Boosting Algorithms [1.5 hrs]

Ensemble Learning

Concept of weak learners

Introduction to boosting algorithms

Adaptive Boosting

Natural Language Processing

Tokenization [1 hr]

Regular Expressions with re module

re.search() and re.findall()

re.split()

Nltk.tokenize

word_tokenize()

sent_tokenize()

non-ASCII tokenization

Topic Identification [1 hr]

Word counting

Introducing corpora

Gensim

Bag-of-words

Introducting TF-IDF

TF-IDF with genism

Day 5

Named Entity Recognition [2 hr]

NER with nltk

Stanford Library with NLTK NER

SpaCy

SpaCy vs nltk

SpaCy NER categories

polyglot: multilingual NER

Exercise: french and spanish NER

NLTK for classification [1.5 hr]

Feature extraction

Train and test sets

CountVectorizer

TfIdfVectorizer

Exercise: fake news detector

Web Scraping [2 hrs]

BeautifulSoup module

Bs4 module

prettify()

HTML tags overview: <head>, <body>, <h1>, <a href>, <title>, <p>, ..

Tag properties

DOM

Object attributes

.title

.p

.parent

.children

.name

.contents

.strings

Dict based lookup

Soup[‘id’]

Multi-valued attributes

.find()

.find_all()

.get()

Day 6 (Optional)

Distributed Applications: Hadoop & Spark

[working knowledge of the following assumed:

Hadoop Architecture, HDFS, Map-Reduce

Pyspark sub-modules: sql, streaming, ml, MLlib

RDD’s, DataFrames, DataSets

pyDoop [1.5 hrs]

Pydoop.hdfs API

Mappers, reducers and combiners

Pipes

Record readers and writers

Partitioners and Combiners

Pydoop command line

Simulator API

Spark Data Processing Use Cases [1.5 hrs]

Graph Processing and Analysis

pySpark.ml and pySpark.MLlib

Example: k-means

Spark Applications with over Hadoop [2 hrs]

Spark Applications vs. Spark Shell

Importing modules on executor nodes

Complex dependencies: native code in egg’s

Heterogenous cluster complexities and solutions

--pyfiles and addPyFiles()

Virtualenv’s

ClusterSSH & ParallelSSH

Anaconda cluster

Preview: Spark SQL [1 hr]

Spark SQL and the SQL Context

Creating DataFrames

Transforming and Querying DataFrames

Saving, restoring DataFrames

De-brief [1 hr]

Suggested approaches for digging deeper

Avoiding confusions

Conquering complexity with isolation

Future references

Summary, wrap-up, Q&A [1 hrs]