Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. With rise of big data machine learning become a key technique for solving problems.
Machine learning uses two types of techniques:
- Used for unlabeled data and where we don’t know the output.
- Self guided learning algorithm
- Clustering Technique: Aim is to use exploratory data analysis to find hidden patterns or groupings in data.
- Used for labelled data and desired output is known.
- providing the algorithm training data to learn from
- Techniques available: Classification and Regression
Amazon Machine Learning:
Amazon ML is a robust machine learning platform that allow developers to train predictive models. Amazon ML creates models from supervised data sets. The process of creating a model from set of known observation called training data. When setting up a new model in Amazon ML, we first need to upload our data. Data needs to be CSV-formatted, with the first row containing the name of each data field, and each following row containing the data samples. Training data sets can be huge, so they need to be uploaded from either Amazon S3 or Redshift storage.
To test the amazon ML, I uploaded the two datasets to S3. I used customer review data to predict whether customer will like the restaurant or not. And second one is to predict House pricing based on previous sale.
Machine Learning Models:
Based Data upload, Amazon ML will automatically infer the recommended model as one of three possibilities:
- Binary classification model (logistic regression)—for classifying data into two categories, like in our positive/negative movie reviews app.
- Multiclass classification model (multinomial logistic regression)—for classifying data into more than two categories; for example, restaurant review based on age, gender , cuisine type, budget etc..
- Linear regression model—for predicting future behavior of a parameter based on its past behavior; for example, prediction of housing prices over time.
Create Data Source:
To start amazon ML go to AWS service menu, click machine learning and click get started.
- Click dashboard > Create New > Datasource and ML model.
- Select s3 and provide the s3 location for datasets.
- click verify and then yes.
- Once verified , click continue.
- click yes on “Does the first line in your CSV contain the column names???
- And continue
- Select target variable which is the one that the model will be trained to predict.
- Select rating as target variable in restaurant dataset and sale price for housing Data. ( I am running two separate Machine learning )
- Leave raw identifier. Raw identifier helps you understand how prediction rows corresponds to input data.
Create ML Model:
- Now review the selection. So based on the dataset AWS machine learning service will automatically select the model. Click on create ML Model.
- So for customer data set, the model is Muticlass and in second dataset the Model is regression.
Evaluate an ML Model:
As part of the process of creating an ML model in previous step involves evaluation of the generated model. The amazon ML learning service does this automatically as part of the model creation process. Model actually split the data set into two parts. Fist part 70% of the data is used to train the model and 30% of the data is used to evaluate the model. This evaluation part is critical, as it shows how well the mode performs. You can use advance options and split the data in sequential or random type.
Let the evaluation part complete.
In multiclass classification model is going to predict one value from a set of possible value or classes.
Once competed, click on evaluation: ML model and explore the model performance. This is called confusion matrix. The rows represent the true values and columns represent the predicted values. It is called confusion class because it easy to see if system is confusing with two classes. Table will represent the number and percentage of correct and incorrect value. The color represents the correct (blue) and incorrect values(yellow) as well.
F1 score measure the quality of the machine learning model. More the F1 score the better the machine learning model quality.
For my second machine learning model is regression. Now look at the performance model for regression. The RMSE (root mean square error) number and lower the RMSE number the better the quality.
And now explore the model performance. The left and right side of the histogram is called residuals. The diagram below shows that there are more under prediction than actual value under selected band width 10000.
Real Time Prediction:
Now let’s try the real-time predictions. Enter the data for first dataset. The predicted value is dislike. So based on our data, a teenager with a low-budget preference would not like a continental high-priced restaurant.
Second example: The predicted value is low here because I just filled the plot size and not actual house dimensions. I still need to work on this data set.
Here is the recipe of second example. You can update your own recipes as well.
In my personal experience, the most crucial and time-consuming part of the job is defining the problem and building a meaningful dataset. Once the dataset is ready, you can easily work on machine learning models and evaluation.