Machine Learning Peep

Sun 18 May 2014 by wbn

Ever since few years ago, big data has stepped into our daily life as a new concept abruptly. Everybody, especially those starting a startup, began to talk about how to use big data to improve nearly everything we have access to. Whenever you open a web page, you appraise a book, you have a meal... what you do can be recorded and then contribute to a machine learning or data mining process. It is just like a big black machine, which is collecting records from every corner of the world consecutively and is running day and night processing them. When you resort to it for some answer, it will say "Hey, based on what you and others who like you have done, this is the result or a result list for you.". This can be the perspective of machine learning from a client.

Machine Learning is such a hot topic that when there is a great rudimental class, you will have to take it inevitablely.(I have came to this class in coursera many times, and after procrastinating for two years, I joined this class this year, so you can see it is inevitablely). And this can definitely do great good for you. So, I think after learning this class for two months, I can peep some basis of machine learning and make a naive comprehension here.

What is Machine Learning

Machine Learning is a branch of artificial intelligence. What machine learning does is to draw a conclusion from a set of data by making a model from the data. There is a formal definition by Tom M. Mitchell:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E

We can explain this by an example of email. Say we have a lot of emails of spam or non-spam labeled, what we want to do is to justify whether a email is a spam email or not. Parallel with the concept above, we have

T: The task is telling if a email is a spam.
E: The process is learning the history data.
P: Tha performance measure is the accuracy of telling a new email is a spam.

So unlike the traditional thoughts about seeking for the reasons of what happended, what machine learning does is to explore the correlation among the data, and then to sent the fresh data to the most related data set.

How to peform a Machine Learning process ##

Machine Learning is a very complicated process, though I don't have experience of large data processing, I make a lot fun doing execises of the ml class and hope what the conclusion I drawed from those learning courses can be generalized.

The steps of machine learning is:

0. Collect raw data 
Repeat: 
    1. Decide the features of data 
    2. Make a data model, ie. define the hypothesis function
    3. Define the cost function(straightfoward based on step 2)
    4. Determine the learning algorithm 
    5. Learn until cost function value is almost the smallest(maybe visualize the
       data to help make adjustment)
6. Get the result

If you want to do machine learning well, the most important thing is to have a large amount of data. This is the basis of your learning process. Without an adequate data set, nothing can be realized even you have a claimed strong algorithm.

After you get the basic data set, you can make your hands dirty. Maybe select some features which can be easily inspected manually first, and formalize the data set based on the features you difine. From now on, the data set you have is just like the whole points in a n-dimension space, and what your task is to find a perfect model or function to put all the points on the model or near the model. From the mathematical point, the task is to train the data to choose the parameters of your model. What the learning process does is just modifying the parameters to make the cost function value smaller, and after a number of cycles, you can have a decent model. Therefore, the machine learning algorithm is about helping us modify the parameters.

What the machine learning focuses on are accuracy and effiency(my unserstanding, may not be complete). Effiency can be tested by how many time the learning algorithm takes. And accuracy is checked by F score.

The F score is caculated by error matrics:

 Precision/Recall/Accuracy
      predicted: 1, actual: 1 -- True positive 1
      predicted: 0, actual: 0 -- True negtive  2
      predicted: 0, actual: 1 -- False negtive 3
      predicted: 1, actual: 0 -- False positive 4

 Precision: 1 / (1 + 4)
 Recall: 1 / (1 + 3)
 Accuracy: (1 + 2) / (1 + 2 + 3 + 4)
 1 + 2 + 3 + 4: the whole data set

 F score:
 F = 2 * (P * R) / (P+ R)

Learning Algorithm Learned

Methods to make learning better

feature scaling: standardize the data range
mean normalization: make the mean value 0
learning rate: be adjusted while learning to make the learning converged quickly
regularization: keep the result from overfitting
principal component analysis: decrease data occupancy, accelerate learing process, visualize data at the cost of information lost
anomaly detection: eliminate the outliers.

Categories
Tech