Background
Google BigQuery supports running ML models using SQL queries which basically bridges the gap for data analysts and data scientists. As a data analyst, you don’t have to learn python, R, or yet another popular ML framework or library.
A basic understanding of ML discipline is enough and with the help of SQL, data analysts can enter into the complex-looking fancy world of Machine Learning.
BigQuery ML supports various types of ML models such as :
- Linear Regression Binary
- Logistic Regression
- Multiclass Logistic Regression
- K-means clustering and many more.
In this blog, we will build a binary classification model using BigQuery ML to predict Travel Insurance Claim will be filled by the customer or not.
Kaggle Dataset
We will use the Travel Insurance dataset from Kaggle for this tutorial.
Download the Travel Insurance Dataset.
Load data Into BigQuery
We can upload data in Bigquery in many ways, but for this tutorial simplicity I will use Cloud Console from Google Cloud to Load data into BigQuery Table with the name “travel_insurance”. Enable Auto to detect the checkbox so that you don’t have to define a schema for the table.
Creating Logistic Regression Model
After loading data into the table successfully, now we are ready to create our first binary logistic regression classification model. The syntax is pretty simple and self-explanatory.https://medium.com/media/50c6eef1c286c0fe9f2fc62bbf0b7dd4
Model Evaluation
Once the model is created, we will evaluate the model in order to judge if the model is accurate and precise enough to predict our input data.https://medium.com/media/62e3dc154dd05c21420d296b17c2bebf
Above query execution will result in various logistic regression related columns:
- precision
- recall
- accuracy
- f1_score
- log_loss
- roc_curve
Model Prediction
Once we are happy with our result for model evaluation, now we can run our test data against the model to classify customers based on whether they will file a claim or not. We have used the same input data that we used for training just for demo purposes but in reality, separate test data should be used against our trained model.https://medium.com/media/d5a250b3d332db8c2b88bfd3221c93c6
Above query execution added predicted_claim , predicted_claim_probs.label and preditcted_claim_probs.prob columns into the result table. These columns provide details of the probability of a customer filing for the claim.
Conclusion
BigQuery ML is narrowing down the gap between data analysts and data scientists. In my opinion, it’s a great effort from BigQuery to give power of machine learning in hands of data analysts who understand data much better but due to lack of knowledge cannot apply machine learning principles on data and depends on the data scientist.
Let me know your opinion what do you think about BigQuery ML! ✌️
Happy Analyzing!