Get started with machine learning on Azure Databricks

Building machine learning models at scale has not always been easy. Azure Databricks aims to make it easier by providing a single platform using Apache Spark on Azure.

Let us look at three typical ways to build models on Databricks.

  1. Run machine learning jobs on the driver node
  2. Scale out with MLlib
  3. Use the databricks Runtime ML

In this blog post I will use Python and run the code in Databricks’ own notebooks (which are like Jupyter notebooks). Databricks also supports SQL, Scala, and R. One great feature is that the notebooks can be multi language.

Run machine learning jobs on a single node

A Databricks cluster has one driver node and one or more worker nodes. The Databricks runtime includes common used Python libraries, such as scikit-learn. However, they do not distribute their algorithms.

Running a ML job only on the driver might not be what we are looking for. It is not distributed and we could as well run it on our computer or in a Data Science Virtual Machine. However, some machine learning tasks can still take advantage of distributed computation and it a good way to take an existing single-node workflow and transition it to a distributed workflow.

This great example notebooks that uses scikit-learn shows how this is done.

Scalable Machine Learning with MLlib

For distributed machine learning workflows we can use Spark’s machine learning library MLlib. The library includes the most common algorithms for regression, classification, clustering, and collaborative filtering. It also has algorithms for working with features.

MLlib uses a ML Pipelines API to build machine learning workflows. The pipeline concept in MLlib is like the one found in scikit-learn. A dataset in MLlib is a DataFrame from Spark SQL.

A pipeline is built of Transformers and Estimators. Transformers take a DataFrame and converts it into another. Feature transformation and scoring a learned model are Transformers. Estimators are algorithms that can be fit on a DataFrame to produce a Transformer. For detailed information see the pipeline components in the MLlib documentation.

Let us look at one of the example notebooks that uses ML Pipelines predict bike rental counts per hour and how we can build a pipeline.

The pipeline looks like this:

Source: GBT notebook

First, we need to assemble the features into a feature vector using the VectorAssembler Transformer. Thereafter a VectorIndexer Transformer identify which columns should be treated as categorical and index them. MLlib has both Transformers built in.

from import VectorAssembler, VectorIndexer 
featuresCols = df.columns 
# This concatenates all feature columns into a single feature 
# vector in a new column "rawFeatures". 
vectorAssembler = VectorAssembler(inputCols=featuresCols 
  , outputCol="rawFeatures") 
# This identifies categorical features and indexes them. 
vectorIndexer = VectorIndexer(inputCol="rawFeatures" 
  , outputCol="features", maxCategories=4) 

For training the model we use the Gradient-Boosted Trees (GBT) algorithm.

from import GBTRegressor 
# Takes the "features" column and learns to predict "cnt" 
gbt = GBTRegressor(labelCol="cnt") 

The next step is use Spark’s cross validation framework and use it for hyper-parameter tuning. We wrap the model training stage within a CrossValidator stage.

from import CrossValidator, ParamGridBuilder 
from import RegressionEvaluator 
# Define a grid of hyper-parameters to test: 
#  - maxDepth: max depth of each decision tree in the GBT 
#    ensemble 
#  - maxIter: iterations, i.e., number of trees in each GBT 
#    ensemble 
paramGrid = ParamGridBuilder()\ 
  .addGrid(gbt.maxDepth, [2, 5])\ 
  .addGrid(gbt.maxIter, [10, 100])\ 
# We define an evaluation metric.  This tells CrossValidator 
# how well we are doing by comparing the true labels with 
# predictions. 
evaluator = RegressionEvaluator(metricName="rmse" 
  , labelCol=gbt.getLabelCol()
  , predictionCol=gbt.getPredictionCol()) 
# Declare the CrossValidator, which runs model tuning for us. 
cv = CrossValidator(estimator=gbt, evaluator=evaluator 
  , estimatorParamMaps=paramGrid) 

We can now assemble the pipeline. The pipeline is an Estimator.

from import Pipeline 
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv]) 

The last step is to train the model by calling fit() on the pipeline with our train data set. After running fit() the pipeline produces pipelineModel which is a Transformer.

pipelineModel = 

We can now score pipelineModel and make predictions on our test data set and compute RMSE to tell us how well our model makes predictions.

predictions = pipelineModel.transform(test) 
rmse = evaluator.evaluate(predictions) 
print "RMSE on our test set: %g" % rmse 

The full notebook shows the details on how this pipeline was built.

Databricks Runtime ML

The latest Databricks Runtime 4.3 has a number of Python, R, Java, and Scala libraries installed. Another runtime is the Databricks Runtime ML which is built on the Databricks Runtime but also includes several libraries and frameworks useful for machine learning and deep learning, such as TensorFlow, Keras, and XGBoost. It also has support for distributed TensorFlow training using Horovod.

Databricks Runtime ML is in Beta.

Where to begin?

We have seen three ways of getting started with machine learning on Azure Databricks. For existing workflows, running them on the driver node is a good options. If you are building a new model, using MLlib is a better option as it can take advantage of the distributed compute Spark has. Databricks Runtime ML is still in Beta and the documentation is sparse, so unless you already are familiar with, for example, distributed TensorFlow training with Horovod it would not be the place to start.

Get started with machine learning on Azure Databricks

Microsoft AI Developer Workshops in South Africa

I will travel to South Africa in August and September to teach two hands-on AI Developer workshops in Johannesburg and Cape Town on AI, Azure Search, Bot Service, and Cognitive Services.

At the end of the workshop you will be able to:

  • Understand how to configure your apps to call Cognitive Services
  • Build an application that calls various Cognitive Services APIs (specifically Computer Vision)
  • Understand how to carry out Azure Search features to give a positive search experience inside applications
  • Configure an Azure Search service to extend your data to enable full-text, language-aware search
  • Build, train, and publish an LUIS model to help your bot communicate
  • Build an intelligent bot using Microsoft Bot Framework that leverages LUIS and Azure Search
  • Log chat conversations in your bot
  • do rapid development/testing with Ngrok and test your bots with unit tests and direct bot communication
  • Leverage the custom vision service to create image classification services

Sounds interesting? Read more about the workshops and sign up here:

I will also speak at the free SQLSaturday Johannesburg and SQLSaturday Cape Town on 1 September and 8 September, where I will give the talk Introduction to Machine Learning.

Hope to see you there!

Microsoft AI Developer Workshops in South Africa

Upcoming AI talks and workshops in Denmark

I will travel to Denmark to give a few talks and workshops on the topics of AI, machine learning, and deep learning.

  • 29-30 May – Intelligent Cloud Conference in Copenhagen, where I will give two talks: Introduction to deep learning and Microsoft Cognitive Framework and AI in the cloud: Machine learning, cognitive services, and bots.
  • 5 October – Emerging AI developer in Lyngby, Denmark.

Hope to see you there!

Upcoming AI talks and workshops in Denmark

Integration Services (SSIS) internals and performance tuning

Back in 2014, I created a 214-page deck for a pre-con at SQLSaturday Johannesburg and Cape Town. Since then I’ve presented it at a good number of large and small conferences.

I don’t often work with SSIS anymore, so I thought I’d publish the entire deck here.

The material is quite text-heavy, so it should be readable. I compiled the deck from several sources (see the last four slides). Also, please note it doesn’t cover SSIS 2016 and the buffer optimizations that came in that version.

Integration Services (SSIS) internals and performance tuning

SQL Server Performance Troubleshooting Free Scripts and Tools List

Back in the days, I used to collect a lot of different scripts, tools, and other goodies for troubleshooting SQL Server performance issues. These days, however, I use what is available for free and keep a list of those in my head. 

I’ve meant to write that list down for a while, and today Chrissy asked: 

So here it is… 

Disclaimer: While I do work as a Premier Field Engineer for Microsoft, this is my list – this is not an official list from my employer. 

Free scripts and tools from Microsoft

These are scripts and tools provided for free by Microsoft that I use. Some of them come “as-is” with no official support.

  • BPCheck – This is by far my favorite script and one I use all the time. I like to joke that there are one good thing and one bad thing about this script. The good thing is that it gives me a lot of information. The bad thing is that it gives me a lot of information. It can overwhelm the first time you run it, but once you have used it a few times, it is handy.
  • Pssdiag/Sqldiag Manager – This tool creates a customized pssdiag utility that uses sqldiag to collect diagnostics data, such as profiler trace, extended events, perfmon and DMV information. I use this all the time for data collection used to troubleshoot performance problems. There is also version for Linux available (see this blog post, for how to use the Linux version).
  • SQL Nexus – Loads and analyzes performance data collected by Pssdiag. It loads the data into a database and does reporting.
  • PAL – The Performance Analysis of Logs (PAL) tool reads a perfmon counter log and analyzes it using different thresholds. It generates an HTML based report (or XML, if you prefer), which shows which counters are interesting. It has templates for SQL Server workloads but works with a lot of other workloads too.
  • RML – The Replay Markup Language (RML) utilities can analyze profiler traces or replay the trace file against another instance of SQL Server. For how to use it, read this blog post.
  • WinDbg – Windows Debugger. Useful for debugging a memory dump.
  • Tigertoolbox – This is the repository for the SQL Server Tiger Team and contains a lot of good stuff. Maybe you need some SQL Performance Baseline reports, fix your VLFs, or view the waits and latches. It is also the home of BPCheck.
  • diskspd – Storage load generator / performance test tool. It replaces SQLIO. I use this if I want to benchmark the IO subsystem of a new machine / VM. There are two good blog posts about how to use diskspd here and here.
  • SQL Server Management Studio – SSMS is now a standalone application (it used to ship with the SQL Server installation) and got quite a few improvements in the newer versions.
  • sysinternals – Tools for advanced troubleshooting on Windows, which I use from time to time.
  • SQL Server Data Tools – Not a tool I use for performance troubleshooting, but it is in my toolbox for when I have to look at an SSIS package.
  • Visual Studio Code – My go-to editor for… everything.

Free scripts, tools, and resources from the community

The awesome SQL Server community has made a lot of things available.

  • SQL Server Diagnostic Information Queries – Before I discovered Glenn Berry‘s DMV queries, I used to collect scripts myself. Not anymore. Glenn updates his queries every month. While I prefer to use BPCheck to get information from the DMVs, I find that Glenn’s scripts are better structured and easier to use.
  • sp_whoisactive – Think sp_who2… but much better. This script by Adam Machanic shows which statements are being executed and can include the estimated execution plan, acquired locks, tempdb resource allocation, and more.
  • sp_helpindex rewrite – sp_helpindex that ships with SQL Server do not show newer functionality (such as included columns and filters), so Kimberly Tripp rewrote it to include this information. Also, be sure to read her blog post on indexing strategies.
  • Ola Hallengren’s SQL Server Maintenance Solution – Ola’s scripts are great, easy to use, and make life easier for a lot of DBAs around the world. The index and statistics maintenance part of the script is an easy way to reduce index fragmentation and make sure your statistics are up to date.
  • Plan Explorer – While SSMS can show the execution plan just fine, Plan Explorer makes it a little easier to read.
  • Statistics Parser – Parses the output of SET STATISTICS IO and SET STATISTICS TIME and makes it a lot easier to read and analyze.
  • Poor SQL – Ever got a SQL query that is just poor formatted or just one line? This site can format it to something readable.
  • dbatools – I don’t use these PowerShell modules for performance troubleshooting, but they are useful for checking best practices and other DBA tasks.
  • SQL Server Version List – I use this up-to-date list of SQL Server versions to check if any service packs or commutative updates are missing.
  • SQL Server Wait Types Library and Latches Classes Library – While I remember the most common wait types and latches classes I see all the time, I sometimes run into one I haven’t seen before or can’t remember what is. Paul Randal‘s libraries are handy resources to look up those.

That’s it! If you have a free tool or script you use (or wrote yourself), leave a comment – would love to check it out!

SQL Server Performance Troubleshooting Free Scripts and Tools List

Data movement when partitioning existing table in SQL Server

The other day, Jørgen Guldmann (@jorgenguldmann) asked if data would move if we would partition an existing table. The table was large, billions of rows, so moving data could take a long time. He wanted all the existing table in one partition, and then new data loaded to be placed in new partitions. I have seen this with some Data Warehouses, where the partitioning strategy for the fact tables was designed and applied after the database had been put into production and been running for a while. So, I thought to test it out to make sure I would provide a right answer. 

In one of my previous blog posts, I wrote about when a partition split move data. So the question is, would we see the same behavior where we see LOB_INSERT_ROWS and LOB_DELETE_ROWS entries in the transaction log? 

tl;dr: We don’t see data movement with LOB_INSERT_ROWS and LOB_DELETE_ROWS entries in the transaction log. However, the creation of the clustered index copy the data pages from the old structure (heap or existing clustered index) to the new structure – even if it is on the same file group. So yes, data will move and it can have a significant performance impact. 

In my test, I have a small-ish heap called Facts.Sales, with 1,000,000 rows, in a database called PartitonTest. I have then created a right range partition function SalesPF on the TransactionDate column and a partition scheme SalesPS, which are using the same file group as the heap. The partitioning function would have the boundary value right of the existing values in TransactionDate, making sure that all existing data will end up in the first partition. 

To partition the existing table, we need to create a clustered index on the heap. If the table were a clustered index, we would need to do the same thing using WITH (DROP_EXISTING = ON)

However, before we create the clustered index, and partition our table, let’s look at the data pages in our heap by using DBCC IND:

DBCC IND ('PartitionTest', 'Facts.Sales', 1);

The output shows that the data pages contain the data between page 8 and 3131. 

Now, let’s create the clustered index on the partition scheme SalesPS:

ON Facts.Sales(SalesID, TransactionDate)
ON SalesPS(TransactionDate);

After this, DBCC IND shows us that the data has been moved to the data pages between page 24,600 and page 27,599. So, data has been moved. If we look at the Books Online article for index disk space, it states that:

Whenever an index is created, rebuilt, or dropped, disk space for both the old (source) and new (target) structures is required in their appropriate files and file groups. The old structure is not deallocated until the index creation transaction commits. Additional temporary disk space for sorting operations may also be needed.

This aligns with what we saw in the above test. 

If we look at the transaction log using fn_dblog for the entries related to the transaction for the clustered index creation, we only get 3,123 entries. These entries are related to metadata operations and extent allocation. This is a minimal footprint on the transaction log, compared to the when data is moved due to a partition split. I reran the test where data is placed in several partitions, and the result was the same. That being said, large-scale index operations can generate large data loads that can cause the transaction log to fill quickly

In conclusion, when designing Data Warehouses and large fact tables, it is a good idea to design the partition strategy upfront rather than apply it after the database is already in production and the fact table contains a significant amount of data. If applied afterward, partitioning an existing table will cause data movement when creating the clustered index and have a significant performance impact.

Data movement when partitioning existing table in SQL Server

First time speaking at a large tech conference: How I prepared

This is the story of how I prepared for the talk at my first large tech conference and ended up getting a great speaker evaluation.

PASS Summit 2014 was the first time I was speaking at a large tech conference. While I have talked at over 30 conferences over the past two years, some of them with several hundreds of participants, this one was different; PASS Summit had 5,900 registrations. With 233 attendees the room felt packed. A few people were standing in the back and sitting on the floor. 

PASS Summit 2014 talk 

The title of the talk was Integration Services (SSIS) for the DBA. I had already delivered it at SQLBits XII earlier the year. While my speaker evaluation from SQLBits was excellent, I was not happy with the talk I already had. Therefore, I restructured it. 

And it paid off. 

My speaker evaluation score was well above average, and one of the best I have ever got. Some comments in the evaluation were:

AWESOME presentation 
The session was very well balanced. A great amount of slides and demos. 
Fantastic, well-prepared session, with great demos incl. Game of Thrones demo!! 
Explanation of version differences, in particular, was enlightening. Thanks. 
Some very useful management techniques learned David was extremely engaging and made the topic fun for everyone in the session.

Structuring the presentation

A good structure is everything. When I wrote my dissertation at the University of Bradford, my supervisor introduced me to the following structure:

  1. What?
  2. How?
  3. So what?

Another structure is the three-act structure, which creative writing, screenwriting, and storytelling use.

I used the structure I had learned during my studies and ended up with this:

  1. What is SQL Server Integration Services (SSIS)?
  2. How do I deploy, configure, and execute SSIS packages?
  3. How do I monitor and troubleshoot SSIS packages?

The last question is not a “so what?”. But I thought it would still fit into that category. At least close enough. 

I also planned a five minutes introduction, where I introduced myself, the topic, who the talk was for (I wanted to make sure that people knew it was a level 200 DBA focused session, and not a deep dive into the topic from a developer point of view), and what they would learn. 

Before I began the work on slides or demos, I wrote a detailed plan in Evernote. I find that I am much faster creating slides and demos this way. Writing things down in plain text makes it easier for me to think about the content and structure than if I was doing it in PowerPoint.


I am not a big fan of slides with a lot of text. I know people can pull it off, like Kimberly Tripp and Paul Randal, but most people can’t; myself included. I create tech-rich slides for my workshop material, because it is a whole day training, and it is nice for the audience to go back and go through the notes. However, not for short talks. 

Instead, I prefer a big font with just a few words. I am a big fan of Zack Holman’s slide decks and would like to be better at creating beautiful slide decks. 

The slides would follow the structure I had written in Evernote. 

PowerPoint is the de facto slide deck tool in the Windows world where I live. I have created slide decks on my Mac in Keynote and Deckset, which both are great options. When using PowerPoint, I change the default settings:

  • Disable the presenter view. When using presenter view the laptop goes from mirroring the screen to extending it. When I jump out of the presentation to do a demo, the computer goes back to mirroring the screen and the projector will need to readjust. Best case, this gives a few seconds of a black screen and a flickering display for another second or two. Worst case, it messes up the resolution. Don’t use it. Use speaker notes on a separate device (iPad, Surface, a piece of paper) instead.
  • Disable the Slide Show Popup Toolbar. I know others who like to draw on their slides and should keep this. I don’t and don’t want it to show up if I move my mouse.
  • Disable the mouse cursor during the presentation. I don’t need it, and it is an annoyance if I hit my mouse or track pad.


My demos took place in SQL Server Management Studio and SQL Server Data Tools. I tried to use geekiness and humor in the demos rather than just using a standard demo database. For example, I got hold of a Game of Thrones data set and had Dr. Who quotes in my data flow. It spiced things up and worked well. 

The demos should be sophisticated enough to prove your point, but simple enough for the audience to follow. I prefer to create more straightforward demos. Fewer things can go wrong. 

In my first demo, I began the design of a package from scratch. While Buck Woody (blog) says “never type in demos,” I like the life it brings to it. So, I drag-and-dropped my first demo; but only halfway. Then I brought in a finished package, so the audience did not have to watch me drag-and-drop for 15 minutes. It worked very well and much better than if I had only used the finished package to show how SSIS works. 

The rest of my demos were pre-scripted in T-SQL. 

Other things I did was:

  • Have a reset strategy. I could reset my demo in about 30 seconds in case something went wrong.
  • I had configured SQL Server Management Studio to use a larger font size, displaying the results in a separate tab, include the query in the result set, and added line numbers. Size 16 is what I normally use, but the resolution was higher than they had told me, and it was hard to read from the back of the room. So I bumped it up to 18 when I did my tech check. Other settings I changed: Enable lined numbers and show query results in a new tab.
  • ZoomIt is the preferred tool by many presenters, but it had things that bothered me. It did not feel smooth. Instead, I switched to using Windows 8.1’s built-in magnifier, which can do multi-level zoom while still demoing. 
  • I considered using bigger fonts in Visual Studio as well but settled for using the magnifier. I think it worked better.

Be Utterly Prepared (No excuses)

As Scott Hanselman writes in a blog post about being prepared for absolute chaos at technical presentations: “Be Utterly Prepared (No excuses)“. 

For every talk I give, I prepare. This time I took it to another level. I put in at least 80 hours into this one hour and 15 minutes talk. 

And again, it paid off. Everything ran smooth. Here are comments from the speaker evaluation:

Presenter was very knowledgeable on the subject. He kept the steady pace throughout the session and kept it interesting. The session was awesome. Very interactive. He kept the audience awake and engaged in the presentation. He was relaxed but at the same time very focused. He presented an enormous amount of information but in a very easy to follow and understand way. 
The audience was kept well engaged- excellent use of humor and great slides. Excellent! You really made a connection with your DBA audience! 
Great personality. 

He enjoys his work & it shows in his lectures. He had a really good connection with the crowd 
Speaker had great energy and was very passionate and knowledgeable about the material. Would have bought whatever had he been a sales guy. Great job. 
Very well done presentation; good match of material to the time allotted. The pace was right on. Nice job! 
Awesome sample data. We love the disarming, personal touch. The speaker was very knowledgeable of the content and did an excellent job repeating questions and answering them. 
David was extremely engaging and made the topic fun for everyone in the session. Loved it!

Here are a few things I did to be utterly prepared:

  • I wrote detailed speaker notes in Evernote. Not every sentence I wanted to say, but keywords and phrases in a bullet point format. And then I had the speaker notes on my iPad on stage. Nobody seemed to notice it, as I glanced on it from time to time, and it helped me not forget anything. 
  • Backup speaker notes. Yes, even with my speaker notes on my iPad, I had a few details I kept forgetting when doing my dry runs. I wrote them down on a few pieces of paper, in large writing, and placed them on the table on stage.
  • I practiced. And then I practiced more. Until I knew my talk inside out. Before the presentation, I always do a few dry runs. But this time I did it significant more times than usual. I am not great at winging it, because I believe structure is so fundamental to a great talk. I would feel like Speaker 47 if I did.
  • My demos worked. But oh my, did I have a lot of demo fails during my trial runs. Minor stuff, but still something that annoyed me. Trying them out repeatedly helped me get rid of those glitches and made everything run very smooth.
  • I ran my demos inside VMware Workstation. One benefit of using a VM is that it is easier to move the demos to a backup computer. 
  • I brought not one, but two backup laptops. Both laptops had an identical setup (slides and VM for demos) as the primary laptop. And am I happy I did! Because both failed. But my primary laptop worked fine.
  • Time management is very important. To ensure I would not run out of time, I created few additional demos. The extra demos were essential to the talk, in case I didn’t have time for them. Having a watch that shows the remaining time is important to control the pace the talk. One of the best things I’ve bought was a Logitech Professional Presenter R800, which has a build in LCD display that shows the remaining time.
  • At the end of my talk I didn’t take any questions. It is better to end the talk on a high. This was a tip I learned from Scott Hanselman. Instead, I took questions throughout the session. I was happy with this decision. When other presenters taking questions at the end of their session, the audience walked out during the Q and A. Not a few people, but the majority. Instead, my talk ended with a link to the slides and demos, a “thank you so much for attending and have a great conference,” and an applaud from the audience.

On the day

One of the most important things I do when I arrive at a conference is a tech check. While PASS Summit had a projector in the speaker ready room, I went during the lunch break to the room I had to present in and did a tech check. There was trouble with the projector. The result was skewed screen resolution or the screen cut off, which showed in many other presentations. I got a technician to look at it and got it fixed within 5 minutes. 

Don’t do the tech check right before the talk. Never. Things go wrong all the time. I have learned this the hard way. To avoid technical failure, I have over time collected a big bag with different cables, converters, USB sticks, batteries, an extra mouse and extra presenter. 

During my tech check, I went to the back of the room several times and made sure everything looked great.

  • Is the resolution correct? Is the image not stretched? I had trouble here and needed to get a technician. The problem was that my laptop couldn’t figure out the correct resolution.
  • Is part of the image cut off on the projector? This happens more often than not and also happened here. After changing the resolution and pressing the auto adjust button on the projector, everything worked.
  • Big enough font on the slides? Looked good.
  • Did the colors look good? Are they readable? Are the photos and other graphical elements sharp and easy to see? Check, check, check.
  • Big enough font in the demos? No, so I bumped it up.
  • Did the magnifier work? Yes.
  • Was the Logitech presenter and mouse working and charged? Yes.
  • Was the laptop charged and plugged in. Always check if it is charging – I have tried to run out of battery mid-talk.

During the talk

I always try to start right on time. I am Danish that way. And end on time. Sometimes I small talk with the audience before the talk, sometimes I don’t. If I do, I am being a little silly or telling a funny story. 

I followed the prepared script close, but still allowed time for questions throughout the talk and going off-script here and there. 

I had a lot of fun. The first few minutes are always the hardest. Will they like it? Is everything working fine, or will the laptop blow up (or just run out of battery)? After that, I am having a blast. One of the best comments I got was that I take my personality with me onto the stage. I try to do that as I want to show people I am having fun doing this. 

To sum up, I try to not waste everybody’s time for an hour and 15 minutes by being prepared.

Further reading:

A few blog posts and online courses I recommend anyone serious about public speaking at tech conferences. They helped me a lot.

First time speaking at a large tech conference: How I prepared