Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
From the moment ML gained its hype, many of us jumped in and started learning it. Traversing from Andrew Ng to Siraj Raval, linear algebra to deep learning and neural nets, I have seen it all, but it didnāt get me anywhere. I havenāt hit my eureka moment even after going through series of tutorials and folders full of ML projects.
For you to hit the magic moment of where it all makes sense to you, you have to see it working the magic in front of you, which never really happened neither to the ML models I created nor toĀ me.
Stepping stones toĀ ML
- Get theĀ dataset
- PreprocessāāāCleaning, fillNaN, LabelEncoding, OneHotEncoding
- Find whether it is a regression/classification/clustering problem
- Train testĀ split
- Fitāāāpredict/classify
- Accuracy score, Confusion Matrix
I have done this all over and over again over a different set of data. Result? Huge list of folders of ML projects not getting me anywhere.
Experts would have told me to call myself an expert but I had other ideasĀ š
The main problem I had was that the datasets were not intuitive enough as they were not from a domain that I closely work with. I was waiting for a right real-time dataset that I understand entirely, and I gotĀ one.
Every step from here on is a stepping stone towards my Epic failure. It is all those little decisions that I made along the way which led me through the path towards the dead end showing me the biggest ML enlightenment.
Getting theĀ Dataset
I had conducted this 2 day Python workshop at Forge accelerator, Coimbatore. At the end of 2 days, I collected feedback from them. The moment I saw the collected data I was mentally classifying them as good job/bad job and I knew I got my first intuitive ML project to workĀ on.
Mistake 1
I had only 26 students in my workshop. It means my dataset was really really small to work with any minimalistic ML algorithm.
Realization
Any dataset should for an ML experiment be of sizeĀ 50.
Understanding theĀ dataset
The next day I was sitting with the CSV file open I was scrolling through left to right, and I could categorize it as good, bad, avg. Now I wanted to see whether an ML model can doĀ that.
Mistake 2
I was thinking of it as a clustering problem. The fact that I completely missed here was that I have gone through enough samples of good & bad reviews to categorize it which makes it a classification problem.
Realization
Classifying the use case as clustering or classification or regression is a huge problem while learning ML. I now understand why all the ML courses out there take so much effort in explaining these concepts.
Preprocessing
The feedback data is of 2 parts. One is numerical ratings, and the other is reviews in the text. For any text-based dataset, the preprocessing step involves converting them into numerical data.
Mistake 3
I fed all the numerical ratings into a labelEncoder and then into OneHotEncoder thinking that they represent a range of categories of data from Very Bad to VeryĀ Good.
Realization
It should remain a numerical value because it does not define a category, rather it is a weightage to the review. There is no need to OneHotEncode theseĀ values.
Mistake 4
For the reviews text, I fed it into TextBlob sentiment classifier and converted them to zeros and ones based on whether it is good or a badĀ review.
Realization
I knew that in order to work with a text-based data you have to convert it into numerical data. In this context the numerical data should be vectors representing the review statements rather than a classified score of good orĀ bad.
Clustering
You can never go wrong in this step. Now that the data was ready I fed it inside a KNN clustering algorithm with 2 bins. After clustering, I got an output of 0ās and 1ās. Matching it with existing sentiment analyzed data I couldnāt conclude it was working fine. So I decided to move on and visualize it to see whatās goingĀ on.
Visualization
This is the step where every mistake that I mentioned above came to limelight. I plotted few features and the centroids, and I got an image that looks something likeĀ this.
From this picture, you can see that the X values are accumulated at the ends of the graphs falling either at 0 or 1, and there is no point in clustering them.
Hence I came to a most intuitive understanding of a Machine learningĀ concept:
You canāt perform KMeans clustering on categorical data
Originally published at www.thegeekette.me on April 22,Ā 2018.
ML Experiment & Epic Failure was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.