'Big Data' Machine Learning using Spark & Azure with the Women in Data Initiative

As a Data Scientist, I fall directly into the category of programmer who, at least once a week, uses the phrase ‘sure, but…. I’m not a Data Engineer’. Back in the days in which CSVs, Jupyter notebooks and ‘Titanic’ style datasets from Kaggle were enough to satisfy Data Science needs, this response was pretty valid. Sure enough, this basic stack proved totally sufficient with which to do some pretty exciting Machine Learning and analysis.

Everyone’s first Machine Learning challenge on the Kaggle Titanic dataset. Thanks to the forums over at https://www.kaggle.com/cdeotte/titanic-using-name-only-0-81818 for the visual.

Yet here we are, in 2020, with access to colossal datasets; most of which, to my horror, don’t contain a CSV in sight. Of course, us Data Scientists have had to adapt, and now we regularly ingest from a wealth of data sources, such as database exports, API connections and web scrapes. However, we are still in a phase where we’re learning how to take this one step further. Modern day Data Science value comes from not only advanced modelling, but usability, scalability and dynamic outputs - you guessed it, that means models in production.

Enter Spark, Platform-As-A-Service (PaaS) cloud providers, and now the newest tool in the stack combining the two- Databricks. Databricks is a young company (around 3 years old) providing a unified platform approach to data analytics through Spark and associated tools, solving the age-old problem of combining big data with traditional data science methods- for people like me, who now desperately need to work with big data, but perhaps often lack the skillset for handling large-scale data engineering tasks.

I was fortunate to be able to attend a training course, organised by the absolutely wonderful Women in Data initiative and hosted by Databricks themselves, to give us an introduction to the user interface and a peek at what happens behind the scenes.

TL;DR SUMMARY

Databricks provides a seamless, accessible interface to the world of Spark for Data Scientists
It hooks up your PaaS provider backend to a sleek user environment, so that you can store large datasets in the cloud without having to fiddle around with clusters and configurations behind the scenes
This in turn enables easy access, exploring and machine learning algorithm training on big datasets, in a way that doesn’t necessarily require knowledge of advanced concepts such as in-database wrangling, parallelisation of code
The workspace replicates that of familiar data concepts such as notebooks, dataframes and wrangling functions
An additional benefit for Data Scientists is access to scheduling function, where notebooks can run as jobs on a regular cadence

A huge thank you to the Women in Data initiative and Databricks for organising a hugely informative and digestible session. Find out more about WiD here and Databricks here.