티스토리 뷰

[2017.09.06 17:44]

 

Tensorflow와 Numpy 기반으로 Machine Learning Python 패키지 및 모듈을 구성해보는 중입니다.

 

입력된 데이터에 대해 아래의 머신러닝 파이프라인에 따라 테스트하고 검증할 수 있도록 구현해보고 있습니다.

자주 사용되는 부분에 대해서 재사용성을 높일 수 있도록 구성해보는중입니다.

 

<Tensorflow/NumPy기반 Custom Python 패키지/모듈 구성>

- Package Name: asyncml

 

<Machine Learning Pipeline>

Raw Data -> Pre-Processing -> Training -> Prediction

                                                      |

                                                      |

                           Diagnostic(Hyper Parameter Tuning, Learning Curve, Error Metrics 등)

 

1. Pre-Processing

- Adding bias term

- Feature Scaling(Mean Normalization)

- Adding Polynomial Features(Fix high bias problem)

- Extract train, cross validation, test set

 

2. Data Input

- Call Tensorflow Input Reader 

 

3. Training

- Linear Regression, Multi-class Classification, ...

 

4. Diagnostic/Prediction

- Hyper parameter Tuning

- Cost/Iteration Graph

- Data set

- Polynomial Regression

- Learning Curve

- Error Metrics(Precision/Recall/Accuracy/F Score for binary and multi-class classification)

 

<asyncml, pre_process.py>

"""
 * This file is a part of asyncml package.
 *
 * (c) Jinmyung Joo <jm84.joo@gmail.com>
 *
 * For the full copyright and license information, please view the LICENSE
 * file that was distributed with this source code.
"""

import numpy as np

def map_dataset(x, train_ratio=0.6, cv_ratio=0.2, test_ratio=0.2):
    """Returns arrays of training, cross validation. test set.

       Args: 
         x: (Mandatory) an array of input data. 
         train_ratio: (Optional) a ratio of training set in all of dataset.
         cv_ratio: (Optional) a ratio of cross validation set in all of dataset
         test_ratio: (Optional) a ratio of test set in all of dataset

       Returns: 
         train_set: an array of training set.(60%)
         cv_set: an array of cross validation set.(20%)
         test_set: an array of test set.(20%)
       """


    # Get row/column size of the array
    m = np.size(x, 0)

    training_offset = (int) (m * train_ratio)
    cv_offset = (int)(m * cv_ratio)
    test_offset = (int)(m * test_ratio)

    if m - (training_offset+cv_offset+test_offset) > 0:
        test_offset += m - (training_offset+cv_offset+test_offset)

    # print(m,training_offset,cv_offset,test_offset)

    x_train = x[0:training_offset, :]
    x_cv = x[training_offset:training_offset+cv_offset, :]
    x_test = x[training_offset+cv_offset:training_offset+cv_offset+test_offset, :]

    return x_train, x_cv, x_test

def add_bias_term(x):
    """Returns an array which includes the bias term. 

    Args: 
      x: (Mandatory) an array of input data. 

    Returns: 
      An array including the bias term. 
    """

    # Get row/column size of the array
    m = np.size(x, 0)
    n = np.size(x, 1)

    # Create a column of ones.
    bias = np.ones([m, 1])

    bias = bias.reshape(-1, 1)

    x_bias = np.concatenate((bias, x), axis=1)

    return x_bias

def feature_scaling(x):
    """Returns a normalized x, mu and sigma.

       Args:
         x: (Mandatory) an array of input data. 

       Returns: 
         x: A normalized array of x
         mu: Average among features 
         sigma: Standard deviation among features
         
       """

    # Get row/column size of the array
    m = np.size(x, 0)
    n = np.size(x, 1)

    mu = np.zeros([n])  # x1, x2
    sigma = np.zeros([n])
    x_norm = x

    # tolerance
    tol = 1e-10

    for j in range(n):
        mu[j] = np.mean(x_norm[:, j])
        sigma[j] = np.std(x_norm[:, j])

        # Avoid zero variance issue
        if sigma[j] <= tol:
            sigma[j] = 1

        # Calculate normalized x using with mean normalization
        for i in range(m):
            x_norm[i, j] = (x_norm[i, j] - mu[j]) / sigma[j]

    return x_norm, mu, sigma



def map_polynomial_features(x1, x2, degree):
    """Returns a new feature array with more features as follows.
       x1, x2, x1^2, x2^2, x1*x2, x1*x2^2,...

           Args:
             x1: (Mandatory) feature x1
             x2: (Mandatory) feature x2
             degree: (Mandatory) Degree of polynomials

           Returns: 
             x: Returns a new feature matrix with more features
             
           Note:
             Adds a column of ones to the new feature array.

    """

    # Get row size of the array
    m = np.size(x1, 0)

    # Create a column of ones.
    x_polynomials = np.ones([m, 1])

    for i  in range(1,degree+1):
        for j in range(0, i+1):
            temp = (x1**(i-j))*((x2**j))
            temp = temp.reshape(-1, 1)

            x_polynomials = np.concatenate((x_polynomials, temp), axis=1)

    return x_polynomials

 

<asyncml, core.py>

 
"""
 * This file is a part of asyncml package.
 *
 * (c) Jinmyung Joo <jm84.joo@gmail.com>
 *
 * For the full copyright and license information, please view the LICENSE
 * file that was distributed with this source code.
"""

import tensorflow as tf

class asyncml_core:
    file_names = 0
    source_file_names = 0
    iterator = 0
    next_element = 0

    epochs = 0
    batch_size = 0
    is_shuffle = False

    num_of_columns = 0

    def prepare_parameters(self, epochs=1, batch_size=10000, is_shuffle=False):
        self.epochs = epochs
        self.batch_size = batch_size
        self.is_shuffle = is_shuffle

    def prepare_csv_dataset(self, source_file_names, file_names=tf.placeholder(tf.string, shape=[None]), skip_num_of_lines=0, num_of_columns=0, shuffle_buffer_size=10000):
        self.file_names = file_names
        self.source_file_names = source_file_names
        self.num_of_columns = num_of_columns
        dataset = tf.contrib.data.TextLineDataset(self.file_names)

        if skip_num_of_lines > 0:
            dataset = dataset.skip(skip_num_of_lines)

        if self.is_shuffle == True:
            dataset = dataset.shuffle(shuffle_buffer_size)

        dataset = dataset.batch(self.batch_size)

        self.iterator = dataset.make_initializable_iterator()

        record_defaults = [tf.constant([0], dtype=tf.float32)] * self.num_of_columns

        self.next_element = tf.stack(tf.decode_csv(self.iterator.get_next(), record_defaults), axis=1)

    def run(self, _func_train, _func_batch_done, _func_epoch_done, _func_done):
        current_epoch = 0
        current_iteration = 0

        init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())

        with tf.Session() as sess:
            sess.run(init_op)

            for i in range(self.epochs):
                sess.run(self.iterator.initializer, feed_dict={self.file_names: self.source_file_names})

                while True:
                    try:

                        _func_train(sess, sess.run(self.next_element), current_epoch, current_iteration, self.batch_size)

                        current_iteration += 1
                    except  tf.errors.OutOfRangeError:
                        _func_batch_done(sess, current_iteration, self.batch_size)
                        break

                current_epoch += 1

                _func_epoch_done(sess, current_epoch, current_iteration, self.batch_size)
                current_iteration = 0

            sess.run(self.iterator.initializer, feed_dict={self.file_names: self.source_file_names})
            _func_done(sess, sess.run(self.next_element), current_epoch)

        return 0

 

<asyncml, test.py> 

 
import asyncml as ml
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# number of classes
num_of_classes = 7

#feature_size = np.size(X_data, 1)
feature_size = 17
#feature_size = 28 # add polynomials

#max_epochs = 682
max_epochs = 500
batch = 10000
learning_rate = 0.1
hyper_lambda = 0



# for matplot
iter_history = np.zeros([max_epochs])
cost_history =  np.zeros([max_epochs])
accuracy_history =  np.zeros([max_epochs])

'''
Define a custom TensorFlow logic function
'''
def my_func_train(sess, data, current_epoch, current_iteration, batch_size):
    # To-Do
    '''
    train_data, cv_data, test_data = ml.map_dataset(data)

    X_data = ml.add_bias_term(train_data[:, 0:-1])
    y_data = train_data[:, [-1]]

    sess.run(optimizer, feed_dict={X: X_data, y: y_data, _lambda:hyper_lambda})

    loss, acc = sess.run([cost, accuracy], feed_dict={X: X_data, y: y_data, _lambda:hyper_lambda})

    #if current_epoch % 100 == 0:
    print("Epoch: {:5}\tIteration {:5}\tLoss: {:.3f}\tAccuracy: {:.2%}".format(current_epoch+1, current_iteration+1, loss, acc))

    # for matplot
    iter_history[current_epoch] = current_epoch+1
    cost_history[current_epoch] = loss
    accuracy_history[current_epoch] = acc
'''


    return 0

def my_func_batch_done(sess, current_iteration, batch_size):
    # To-Do

    # print("my_func_batch_done function is called.")
    return 0

def my_func_epoch_done(sess, current_epoch, current_iteration, batch_size):
    # To-Do

    # print("Done -- epoch limit reached.", "epoch=", current_epoch)
    return 0

def my_func_done(sess, data, current_epoch):
    # To-Do

#    X_data = ml.add_bias_term(data[:, 0:-1])
#    y_data = data[:, [-1]]

#    pred = sess.run(prediction, feed_dict={X: X_data})
#   for p, _y in zip(pred, y_data.flatten()):
#       print("[{}] Prediction: {}, True Y: {}".format(p == int(_y), p, int(_y)))

    '''
    hypo = sess.run(hypothesis, feed_dict={X: X_data})  # for printing probabilities in each training data

    np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

    for h in hypo:
        print("Probability: {}, Inferred Label: {}".format(h, sess.run(tf.argmax(h, 0))))
    '''
    X = tf.placeholder(tf.float32, [None, feature_size])  # add bias +1
    y = tf.placeholder(tf.int32, [None, 1])
    _lambda = tf.placeholder(tf.float32)

    # one-hot encoding
    y_one_hot = tf.one_hot(y, num_of_classes)
    y_one_hot = tf.reshape(y_one_hot, [-1, num_of_classes])


    # theta = tf.Variable(tf.random_normal([feature_size, num_of_classes]), name='weight')
    theta = tf.Variable(tf.zeros([feature_size, num_of_classes]), name='weight')

    logits = tf.matmul(X, theta)
    hypothesis = tf.nn.softmax(logits)  # Use softmax

    cost_logits = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_one_hot)
    cost = tf.reduce_mean(cost_logits) + _lambda * tf.reduce_mean(tf.square(theta))
    # optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
    optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate).minimize(cost)
    predicted = tf.argmax(hypothesis, 1)
    actual = tf.argmax(y_one_hot, 1)

    acc, acc_op = tf.metrics.accuracy(labels=actual, predictions=predicted)
    rec, rec_op = tf.metrics.recall(labels=actual, predictions=predicted)
    pre, pre_op = tf.metrics.precision(labels=actual, predictions=predicted)

    confusion_matrix_op = tf.contrib.metrics.confusion_matrix(labels=actual, predictions=predicted, num_classes=num_of_classes)

   # correct_prediction = tf.equal(predicted, actual)
   # _accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))




    train_data, cv_data, test_data = ml.map_dataset(data)

    num_of_training_data = np.size(train_data, 0)

    num_of_train = np.zeros([num_of_training_data])
    train_error = np.zeros([num_of_training_data])
    cv_error = np.zeros([num_of_training_data])

    X_data = ml.add_bias_term(train_data[:, 0:-1])
    y_data = train_data[:, [-1]]

    X_cv_data = ml.add_bias_term(cv_data[:, 0:-1])
    y_cv_data = cv_data[:, [-1]]


    # Calculate Learning Curve(Error / Number of Training Data)
    sess.run(tf.global_variables_initializer())

    for i in range(num_of_training_data):
        num_of_train[i] = i+1;

        for _ in range(max_epochs):
            sess.run(optimizer, feed_dict={X: X_data[0:i,:], y: y_data[0:i,:], _lambda: 0})

        train_error[i] = sess.run(cost, feed_dict={X: X_data[0:i,:], y: y_data[0:i,:], _lambda: 0})
        cv_error[i] = sess.run(cost, feed_dict={X: X_cv_data, y: y_cv_data, _lambda: 0})

    # Calculate Learning Curve(Error / Lambda)
    sess.run(tf.global_variables_initializer())

    lambda_array = [0, 0.001, 0.003, 0.01, 0.003, 0.1, 0.3, 1, 3, 10]
    lambda_array_size = np.size(lambda_array, 0)
    train_error_lambda = np.zeros([lambda_array_size])
    cv_error_lambda = np.zeros([lambda_array_size])

    for i in range(lambda_array_size):

        for _ in range(max_epochs):
            sess.run(optimizer, feed_dict={X: X_data, y: y_data, _lambda: lambda_array[i]})

        train_error_lambda[i] = sess.run(cost, feed_dict={X: X_data, y: y_data, _lambda: 0})
        cv_error_lambda[i] = sess.run(cost, feed_dict={X: X_cv_data, y: y_cv_data, _lambda: 0})

    # Calculate Learning Curve(Error / Degree of Polynomials)
    degree_of_poly = 8
    select_poly_feature1_index = 9
    select_poly_feature2_index = 12

    num_of_poly = np.zeros([degree_of_poly])
    train_error_degree = np.zeros([degree_of_poly])
    cv_error_degree = np.zeros([degree_of_poly])

    for i in range(degree_of_poly):
        num_of_poly[i] = i+1
        X_poly_data = ml.map_polynomial_features(X_data[:, select_poly_feature1_index],
                                                 X_data[:, select_poly_feature2_index], np.int32(num_of_poly[i]))
        X_poly_data, mu, sigma = ml.feature_scaling(X_poly_data)

        X_poly_cv_data = ml.map_polynomial_features(X_cv_data[:, select_poly_feature1_index],
                                                    X_cv_data[:, select_poly_feature2_index], np.int32(num_of_poly[i]))
        X_poly_cv_data, cv_mu, cv_sigma = ml.feature_scaling(X_poly_cv_data)

        X2 = tf.placeholder(tf.float32, [None, np.size(X_poly_data, 1)])  # add bias +1
        y2 = tf.placeholder(tf.int32, [None, 1])
        _lambda2 = tf.placeholder(tf.float32)

        # one-hot encoding
        y_one_hot2 = tf.one_hot(y2, num_of_classes)
        y_one_hot2 = tf.reshape(y_one_hot2, [-1, num_of_classes])

        # theta = tf.Variable(tf.random_normal([feature_size, num_of_classes]), name='weight')
        theta2 = tf.Variable(tf.zeros([np.size(X_poly_data, 1), num_of_classes]), name='weight2')

        logits2 = tf.matmul(X2, theta2)
        hypothesis2 = tf.nn.softmax(logits2)  # Use softmax

        cost_logits2 = tf.nn.softmax_cross_entropy_with_logits(logits=logits2, labels=y_one_hot2)
        cost2 = tf.reduce_mean(cost_logits2) + _lambda2 * tf.reduce_mean(tf.square(theta2))
        # optimizer2 = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost2)
        optimizer2 = tf.train.AdagradOptimizer(learning_rate=learning_rate).minimize(cost2)

        sess.run(tf.global_variables_initializer())

        for _ in range(max_epochs):
            sess.run(optimizer2, feed_dict={X2: X_poly_data, y2: y_data, _lambda2: 0})

        train_error_degree[i] = sess.run(cost2, feed_dict={X2: X_poly_data, y2: y_data, _lambda2: 0})
        cv_error_degree[i] = sess.run(cost2, feed_dict={X2: X_poly_cv_data, y2: y_cv_data, _lambda2: 0})


    # Calculate Precision/Recall/Accuracy
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())

    for _i in range(max_epochs):
       sess.run(optimizer, feed_dict={X: X_data, y: y_data, _lambda: hyper_lambda})
       #loss, acc = sess.run([cost, accuracy], feed_dict={X: X_data, y: y_data, _lambda: hyper_lambda})

    #print("Training Set > Epoch: {:5}\tLoss: {:.3f}\tAccuracy: {:.2%}".format(max_epochs, loss, acc))
    #accuracy, recall, precision, confusion_matrix = sess.run([acc_op, rec_op, pre_op, confusion_matrix_op], feed_dict={X: X_cv_data, y: y_cv_data, _lambda: hyper_lambda})
    confusion_matrix = sess.run(confusion_matrix_op,
                                                             feed_dict={X: X_cv_data, y: y_cv_data,
                                                                        _lambda: hyper_lambda})

    print("predic=", sess.run(predicted, feed_dict={X: X_cv_data, y: y_cv_data, _lambda: hyper_lambda}))
    print("actual=", sess.run(actual, feed_dict={X: X_cv_data, y: y_cv_data, _lambda: hyper_lambda}))
    print("confusion matrix=")
    print(confusion_matrix)
   # f1_score = 2*((precision*recall)/(precision+recall))
    #print("Cross Validation Set > Accuracy: {:.2%}\tPrecision: {:.2%}\tRecall: {:.2%}\tF1 Score: {:.2%}".format(accuracy, precision,recall,f1_score))

    for i in range(num_of_classes):
        TP = float(confusion_matrix[i,i])
        TP_FN = float(np.sum(confusion_matrix[i,:]))

        try:
            _recall = TP / TP_FN
        except ZeroDivisionError:
            _recall = 0

        TP_FP = float(np.sum(confusion_matrix[:, i]))

        try:
            _precision = TP / TP_FP
        except ZeroDivisionError:
            _precision = 0

        try:
            _f1_score = 2 * ((_precision * _recall) / (_precision + _recall))
        except ZeroDivisionError:
            _f1_score = 0

        print("Cross Validation Set > Class: {}\tPrecision: {:.2%}\tRecall: {:.2%}\tF1 Score: {:.2%}".format(i,_precision, _recall, _f1_score))


    #print(_tp)
    #print("TP: {:.2%}".format(_tp))
   # loss, acc = sess.run([cost, accuracy], feed_dict={X: X_cv_data, y: y_cv_data, _lambda: hyper_lambda})
    #print("Cross Validation Set > Loss: {:.3f}\tAccuracy: {:.2%}\tPrecision: {:.2%}\tRecall: {:.2%}".format(loss, acc, 1.0, 0.0))



    # for matplot
    # Calculate Learning Curve(Error / Degree of Polynomials)

    plt.figure(1)
    plt.plot(num_of_poly[0:degree_of_poly], train_error_degree[0:degree_of_poly], label='Train')
    plt.plot(num_of_poly[0:degree_of_poly], cv_error_degree[0:degree_of_poly], label='Cross Validation')
    plt.title("Learning Curve for Multinomial Classification")
    plt.xlabel("Degree of Polynomials")
    plt.ylabel("Error")
    plt.grid(True)
    plt.legend()


    # Calculate Learning Curve(Error / Number of Training Data)
    plt.figure(2)
    plt.plot(num_of_train[0:num_of_training_data], train_error[0:num_of_training_data], label='Train')
    plt.plot(num_of_train[0:num_of_training_data], cv_error[0:num_of_training_data], label='Cross Validation')
    plt.title("Learning Curve for Multinomial Classification")
    plt.xlabel("Number of Training Data")
    plt.ylabel("Error")
    plt.grid(True)
    plt.legend()

    # Learning Curve(Error / Lambda)
    plt.figure(3)
    plt.plot(lambda_array[0:lambda_array_size], train_error_lambda[0:lambda_array_size], label='Train')
    plt.plot(lambda_array[0:lambda_array_size], cv_error_lambda[0:lambda_array_size], label='Cross Validation')
    plt.title("Learning Curve for Multinomial Classification")
    plt.xlabel("Lambda")
    plt.ylabel("Error")
    plt.grid(True)
    plt.legend()


    '''
    # Accuracy / Cost Graph
    plt.figure(4)
    plt.plot(iter_history[0:current_epoch], cost_history[0:current_epoch], label='Cost')
    plt.plot(iter_history[0:current_epoch], accuracy_history[0:current_epoch], label='Accuracy')
    plt.annotate("Accuracy={:.2f}%".format(accuracy_history[current_epoch-1]*100), xy=(current_epoch, accuracy_history[current_epoch-1]), xytext=(current_epoch-(current_epoch*50/100), accuracy_history[current_epoch-1]+(accuracy_history[current_epoch-1]*25/100)), arrowprops=dict(facecolor='black', shrink=0.06), )
    plt.title("Multinomial Classification")
    plt.xlabel("Number of iterations")
    plt.ylabel("Cost / Accuracy")
    plt.grid(True)
    plt.legend()
    

    
    '''
    plt.show()

    #print("Done -- epoch limit reached.", "epoch=", current_epoch)
    return 0

'''
Create an asynml instance and run the TensorFlow logic which is defined as a custom function. 
'''
ml_instance = ml.asyncml_core()

# num_of_features, it includes the label column.
ml_instance.prepare_parameters(epochs=1, batch_size=batch, is_shuffle=False)
ml_instance.prepare_csv_dataset(["data-04-zoo.csv"], skip_num_of_lines=19, num_of_columns=17)
ml_instance.run(my_func_train, my_func_batch_done, my_func_epoch_done, my_func_done)


 

<Test Log>

Epoch:    10	Iteration     1	Loss: 1.108	Accuracy: 70.00%
Epoch:    20	Iteration     1	Loss: 0.833	Accuracy: 86.67%
Epoch:    30	Iteration     1	Loss: 0.665	Accuracy: 88.33%
Epoch:    40	Iteration     1	Loss: 0.555	Accuracy: 91.67%
Epoch:    50	Iteration     1	Loss: 0.477	Accuracy: 93.33%
Epoch:    60	Iteration     1	Loss: 0.420	Accuracy: 93.33%
Epoch:    70	Iteration     1	Loss: 0.376	Accuracy: 95.00%
Epoch:    80	Iteration     1	Loss: 0.341	Accuracy: 96.67%
Epoch:    90	Iteration     1	Loss: 0.312	Accuracy: 96.67%
Epoch:   100	Iteration     1	Loss: 0.288	Accuracy: 96.67%

predic= [3 3 3 0 0 0 0 0 0 0 0 1 5 3 0 0 3 6 1 1]
actual= [3 3 2 0 0 0 0 0 0 0 0 1 6 3 0 0 2 6 1 1]
confusion matrix=
[[10  0  0  0  0  0  0]
 [ 0  3  0  0  0  0  0]
 [ 0  0  0  2  0  0  0]
 [ 0  0  0  3  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0]
 [ 0  0  0  0  0  1  1]]
Cross Validation Set > Class: 0	Precision: 100.00%	Recall: 100.00%	F1 Score: 100.00%
Cross Validation Set > Class: 1	Precision: 100.00%	Recall: 100.00%	F1 Score: 100.00%
Cross Validation Set > Class: 2	Precision: 0.00%	Recall: 0.00%	F1 Score: 0.00%
Cross Validation Set > Class: 3	Precision: 60.00%	Recall: 100.00%	F1 Score: 75.00%
Cross Validation Set > Class: 4	Precision: 0.00%	Recall: 0.00%	F1 Score: 0.00%
Cross Validation Set > Class: 5	Precision: 0.00%	Recall: 0.00%	F1 Score: 0.00%
Cross Validation Set > Class: 6	Precision: 100.00%	Recall: 50.00%	F1 Score: 66.67%

 

<Cost/Accuracy Graph>


 

 

<Learning Curve>


 

 

정리중....

 

---------------------------------------------------------------------------------------------------------------------------------

<머신러닝 내용 정리>

 

1. Diagnotic
1) Get more training examples -> fix high variance


2) Try smaller sets of features(if overfitting) -> fix high variance


3) Try getting additional features(if underfitting) -> fix high bias


4) Try adding polynomial features -> fix high bias
- Polynomial Regression

 

5) Try descreasing Lambda -> fix high bias
-  Regularization(if Underfitting-High Bias)

 

6) Try increasing Lambda -> fix high variance
-  Regularization(if Overfitting-High Variance)

 

7) Model Selection
- Select degree of polynomial
d=1, h(x) = theta0 + theta1 * x
d=2, h(x) = theta0 + theta1 * x^2

 

- Dataset
Training Set: 60%
Cross Validation Set: 20%
Test Set: 20%

 

- Bias/Variance
. Underfitting-High Bias
. Overfitting-High Variance

 

x-axis: degree of polynomial
y-axis: training error, cv error

 

- Regularization
x-axis: Lambda
y-axis: training error, cv error

 

8) Learning Curve
x-axis: Training Size
y-axis: training error, cv error

 

High Bias -> getting more data will not help much
High Variance -> getting more data is likely to help

 

7) Cost/Iteration Graph, Learning Rate

x-axis: number of iterations

y-axis: cost function

 

8) Error Metrics for skewed classes
- Precision, Recall, Accuracy
- F Score

 

2. Pre-processing
1) Feature Scaling(Mean Normalization)

 

3. Support Vector Machine


4. Unsupervised Learning - K-Mean Clustering


5. Princial Component Analysis(PCA)


6. Anomaly Detection


7. Recommender System


8. Learning With Large Dataset
- Batch, Mini-Batch, Stochastic Gradient Descent

 

1 epoch

- all m examples

- n batch

- k iteration

 

ex) examples = 100,000 , batch size = 100, epoch = 5, iteration = 100 in each epoch

 

<1st epoch>

1 iteration          - 100 batch size [   ]

1 iteration          - 100 batch size [   ]

1 iteration          - 100 batch size [   ]

        .

        .

        .

1 iteration          - 100 batch size [   ]

= 100 iteration    = 100,000 examples

 

<2nd epoch >  

        .

        .

        .

<5th epoch>


asyncml.zip


 

* 참고: http://www.coursera.org/ > Coursera Machine Learning 강의

* 참고: https://www.tensorflow.org/programmers_guide/datasets

* 참고: https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html

* GitHub: https://github.com/asyncbridge/asyncml


 


 

댓글
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/04   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
글 보관함