# Spark Machine Learning Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June...

date post

27-Jun-2020Category

## Documents

view

0download

0

Embed Size (px)

### Transcript of Spark Machine Learning Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June...

Spark Machine Learning

Amir H. Payberah amir@sics.se

SICS Swedish ICT June 30, 2016

Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data

Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Data Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Data Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I Should I lend money to this customer given his spending behavior?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge

Data Knowledge

I Is this email spam or no spam?

I Is there a face in this picture?

I Should I lend money to this customer given his spending behaviour?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge

I Knowledge is not concrete

I Spam is an abstraction

I Face is an abstraction

I Who to lend to is an abstraction

You do not find spam, faces, and financial advice in datasets, you just find bits!

Amir H. Payberah (SICS) MLLib June 30, 2016 4 / 1

Knowledge Discovery from Data (KDD)

I Preprocessing

I Data mining

I Result validation

Amir H. Payberah (SICS) MLLib June 30, 2016 5 / 1

KDD - Preprocessing

I Data cleaning

I Data integration

I Data reduction, e.g., sampling

I Data transformation, e.g., normalization

Amir H. Payberah (SICS) MLLib June 30, 2016 6 / 1

KDD - Mining Functionalities

I Classification and regression (supervised learning)

I Clustering (unsupervised learning)

I Mining the frequent patterns

I Outlier detection

Amir H. Payberah (SICS) MLLib June 30, 2016 7 / 1

KDD - Result Validation

I Needs to evaluate the performance of the model on some criteria.

I Depends on the application and its requirements.

Amir H. Payberah (SICS) MLLib June 30, 2016 8 / 1

MLlib - Data Types

Amir H. Payberah (SICS) MLLib June 30, 2016 9 / 1

Data Types - Local Vector

I Stored on a single machine

I Dense and sparse • Dense (1.0, 0.0, 3.0): [1.0, 0.0, 3.0] • Sparse (1.0, 0.0, 3.0): (3, [0, 2], [1.0, 3.0])

val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)

val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

Amir H. Payberah (SICS) MLLib June 30, 2016 10 / 1

Data Types - Labeled Point

I A local vector (dense or sparse) associated with a label.

I label: label for this data point.

I features: list of features for this data point.

case class LabeledPoint(label: Double, features: Vector)

val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Amir H. Payberah (SICS) MLLib June 30, 2016 11 / 1

MLlib - Preprocessing

Amir H. Payberah (SICS) MLLib June 30, 2016 12 / 1

Data Transformation - Normalizing Features

I To get data in a standard Gaussian distribution: x−meansqrt(variance)

val features = labelData.map(_.features)

val scaler = new StandardScaler(withMean = true, withStd = true).fit(features)

val scaledData = labelData.map(lp => LabeledPoint(lp.label,

scaler.transform(lp.features)))

Amir H. Payberah (SICS) MLLib June 30, 2016 13 / 1

MLlib - Data Mining

Amir H. Payberah (SICS) MLLib June 30, 2016 14 / 1

Data Mining Functionalities

I Classification and regression (supervised learning)

I Clustering (unsupervised learning)

I Mining the frequent patterns

I Outlier detection

Amir H. Payberah (SICS) MLLib June 30, 2016 15 / 1

Classification and Regression (Supervised Learning)

Amir H. Payberah (SICS) MLLib June 30, 2016 16 / 1

Supervised Learning (1/3)

I Right answers are given. • Training data (input data) is labeled, e.g., spam/not-spam or a

stock price at a time.

I A model is prepared through a training process.

I The training process continues until the model achieves a desired level of accuracy on the training data.

Amir H. Payberah (SICS) MLLib June 30, 2016 17 / 1

Supervised Learning (2/3)

I Face recognition

Training data

Testing data

[ORL dataset, AT&T Laboratories, Cambridge UK]

Amir H. Payberah (SICS) MLLib June 30, 2016 18 / 1

Supervised Learning (3/3)

I Set of N training examples: (x1, y1), · · · , (xn, yn).

I xi = 〈xi1, xi2, · · · , xim〉 is the feature vector of the ith example.

I yi is the ith feature vector label.

I A learning algorithm seeks a function yi = f(Xi).

Amir H. Payberah (SICS) MLLib June 30, 2016 19 / 1

Classification vs. Regression

I Classification: the output variable takes class labels.

I Regression: the output variable takes continuous values.

Amir H. Payberah (SICS) MLLib June 30, 2016 20 / 1

Types of Classification/Regression Models in Spark

I Linear models

I Decision trees

I Naive Bayes models

Amir H. Payberah (SICS) MLLib June 30, 2016 21 / 1

Linear Models

Amir H. Payberah (SICS) MLLib June 30, 2016 22 / 1

Linear Models

I Training dataset: (x1, y1), · · · , (xn, yn).

I xi = 〈xi1, xi2, · · · , xim〉

I Model the target as a function of a linear predictor applied to the input variables: yi = g(w

Txi). • E.g., yi = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: f (w) := ∑n

i=1 L(g(w Txi ), yi )

I An optimization problem min w∈Rm

f(w)

Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

Linear Models

I Training dataset: (x1, y1), · · · , (xn, yn).

I xi = 〈xi1, xi2, · · · , xim〉

I Model the target as a function of a linear predictor applied to the input variables: yi = g(w

Txi). • E.g., yi = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: f (w) := ∑n

i=1 L(g(w Txi ), yi )

I An optimization problem min w∈Rm

f(w)

Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

Linear Models - Regression (1/2)

I g(wTxi) = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: minimizing squared different between predicted value and actual value: L(g(wTxi ), yi ) :=

1 2(w

Txi − yi )2

I Gradient descent

Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

Linear Models - Regression (1/2)

I g(wTxi) = w1xi1 + w2xi2 + · · ·+ wmxim

I Loss function: minimizing squared different between predicted value and actual value: L(g(wTxi ), yi ) :=

1 2(w

Txi − yi )2

I Gradient descent

Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

Linear Models - Regression (2/2)

val data: RDD[LabeledPoint] = ...

val splits = labelData.randomSplit(Array(0.7, 0.3))

val (trainigData, testData) = (splits(0), splits(1))

val numIterations = 100

val stepSize = 0.00000001

val model = LinearRegressionWithSGD

.train(trainigData, numIterations, stepSize)

val valuesAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

Amir H. Payberah (SICS) MLLib June 30, 2016 25 / 1

Linear Models - Classification (Logistic Regression) (1/2)

I Binary classification: output values between 0 and 1

I g(wTx) := 1 1+e−wT x

(sigmoid function)

I If g(wTxi ) > 0.5, then yi = 1, else yi = 0

Amir H. Payberah (SICS) MLLib June 30, 2016 26 / 1

Linear Models - Classification (Logistic Regression) (2/2)

val data: RDD[LabeledPoint] = ...

val splits = labelData.randomSplit(Array(0.7, 0.3))

val (trainigData, testData

Recommended

*View more*