Is Your Research Software Correct?

Mike Croucher

MathWorks

Twitter: @walkingrandomly
LinkedIn: https://www.linkedin.com/in/mike-croucher-32336113/

Imagine...

Your results are amazing!

but wrong

Mike Konczal

“all I can hope is that future historians note that one of the core empirical points providing the intellectual foundation for the global move to austerity in the early 2010s was based on someone accidentally not updating a row formula in Excel”

What were the real errors?

They used Excel (subject to debate)
They didn't share their code and data (Vital!)

This 2003 trial, done in Kenya, found that deworming whole schools improved children’s health, school performance, and school attendance.

In 2013, the data was reanalysed independently using new computer programs

Many mistakes found.

Further examples

Software

Software is critical to our research but is treated as a third class citizen

We have a problem!

Croucher's law

I can be an idiot and WILL make mistakes.

You are no different!

Strategy

Assume Croucher's law is true

Adapt our working practices

Your Analysis?

What you did

Open package foo. Click, Click, drag, Click, Click, Click, Right-Click, Save, 'results.csv'.

Load into Excel. Click, drag, generate graph, right click, save, 'pretty-graph.png'

Your Analysis?

What you said

I analysed my data in foo using the bar analysis. Here's a graph of the results.

How reproducible is a mouse click?

Automate

aka 'learn to program'

The Ideal

Results = TheAnalysis(MyData)

Reality

Automation is not about time saving

It is knowledge transfer

It is the foundation of reproducible computational research

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)

Write code in a (very) high-level language

Some suggested languages

MATLAB
Python
R
Julia

Why high level languages?

"Programmers write roughly the same number of lines of code per unit time regardless of the language they use"

(Best Practices for Scientific Computing, PLOS Biology, Wilson Et Al)

What about speed?

Computer time is cheap. Programmer time is expensive.
We all have supercomputers now!
Ensure it's correct, then worry about speed.
Call MathWorks / your RSE team to help with the slow bits

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language

Share your code and data openly
(As possible)

Openly as possible?

If can't be fully open, be as open as possible within your organisation

Why share the code and data?

Nothing else contains the information required to fully reproduce your work.

You say

We use K-means in Python with 50 clusters and K-means++ initialisation

You say

No need to share code. It's 2 lines. Trivial!

My results

Also took me 2 lines of Python

My results

Several differences

My results

We used different libraries

Code comparison

Production workflow

Imagine how many gotchas there might be here

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data

Use version control

Is this familiar?

code_ver1.m
code_ver1b_BROKEN.m
code_ver1b_BROKEN_Working_march20.m
code_ver1b_BROKEN_Working_march20_Bobs_mods_ForMike.m

Why version control?

Single Point of Truth for your project
Easy to share and deploy code
Rewind to any point in time
Everything is backed up by default
Can add automated tests later (Continuous integration)
Release management
Project management
Documentation
Which version gave your results

More reasons?

Demo

https://github.com/mikecroucher/Bobs_code

Use git!

They say: It's too much extra work

Main git workflow

Just 3 commands

Do work on file1, file2 and file6

				
git add file1 file1 file6
git commit -m "Description of why you modified those files"
git push origin master

Can't use GitHub?

Speak to IT about installing an in-house GitLab instance

https://about.gitlab.com/

The version control life cycle

git? No thanks, I'm scared!
well this is handy.
we're not using git? I'm scared!

source: https://twitter.com/bobearth/status/571154995506122755

True Story

Me: Can I see the code please?
Them: I'll just get the changes from Bob folded in and email it
Me: Shouldn't we be using version control?
Them: No need - it's overkill. We don't have a VC problem.
Me: The code you sent me doesn't work
Them: Sorry. I sent the wrong version.

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data
Use version control

Environment

Someone sends you this

Environment

Your experience

Production

Think of all those constantly shifting dependencies

Describe your environment

...and control it with Conda

Install Miniconda from https://repo.continuum.io/miniconda/

Running our PCA example

You are told it works using scikit-learn 0.17

conda create --name pca_project python=3.5 scikit-learn=0.17 jupyter
conda activate pca_project
jupyter notebook

Running our PCA example

Set up the exact environment I used


git clone https://github.com/mikecroucher/pca_demo
cd pca_demo
conda env create -f environment.yml
conda activate old_scikit
jupyter notebook

Need more?

Virtual machines
Containerisation (Docker, Singularity etc)
FedEx your laptop

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data
Use version control
Share your environment

Get a code buddy (Code Review Light)

Doesn't have to understand your research

Remit: Tell me where I could do better?

Problem 1: Get the code running on THEIR machine

Get a code buddy

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data
Use version control
Share your environment
Get a code buddy

Literate computing

Traditional reports are just advertisements

A Literate computing document IS the research

Literate computing technologies

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data
Use version control
Share your environment
Get a code buddy
Use literate computing technologies

Afraid to change your code?

Write tests

Every decent language has a testing framework
Learn how to use it
You write additional code that ensures your code gives the answers you expect
Tests give you confidence to make changes


		$ nosetests ./unittests.py
		..............................
		----------------------------------------------------------------------
		Ran 30 tests in 0.152s
		
		OK

Problem

I am an idiot and will make mistakes

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data
Use version control
Share your environment
Get a code buddy
Use literate computing technologies
Write tests

Numerical Computing is hard!

Hypotenuse of a triangle

Easy!

h = sqrt(x*x + y*y)

So why is the hypot function in math.h, Python and MATLAB?

A better hypot


						
						
max = maximum(|x|, |y|)
min = minimum(|x|, |y|)
r = min / max
return max*sqrt(1 + r*r)

Real-world hypot

openlibm - 132 lines of code

https://github.com/JuliaMath/openlibm/blob/master/src/e_hypot.c

Implementation details matter

10,000+ times speed difference between worst and best of the same algorithm

Which algorithms interest you the most?

Tell me at:

(Partial) Solutions

Automate (aka learn to program)
Write code in a (very) high-level language
Share code and data
Use version control
Share your environment
Get a code buddy
Use literate computing technologies
Write tests

Is this enough?

No!

You are not alone!

MathWorks

Products (We build stuff)
Consultancy (We help you do stuff)
Training (We teach you stuff)
Research (We develop new stuff)
Open source (We give some stuff away)
Community

Link to these slides: https://mikecroucher.github.io/reproducible_ML/

Where my ideas came from

Old blog post, from which this talk grew.
Best Practices for Scientific Computing - Wilson et al. Go here for more advanced tips.
Research In Progress - Most of the animated gifs
Phd Comics - Finding humour in the academic way of life
xkcd - A webcomic of romance, sarcasm, math, and language.
Scientists Are Hoarding Data And It’s Ruining Medical Research - Ben Goldacre
"Literate Computing" and computational reproducibility- Fernando Perez
What's so hard about finding a hypotenuse?

Where my ideas came from: Twitter

Resources

Software Carpentry - Learn the craft
Python Testing - Everything you need to know about writing tests in Python
Version Control and Unit Testing for Scientific Software, SciPy2013 Video Tutorial.
How do we know Research Software is Correct? From the software sustainability institute

Is Your Research Software Correct?

Imagine...

Your results are amazing!

but wrong

Mike Konczal

What were the real errors?

Further examples

Software

We have a problem!

Croucher's law

I can be an idiot and WILL make mistakes.

You are no different!

Strategy

Your Analysis?

What you did

Your Analysis?

What you said

How reproducible is a mouse click?

Automate

Automation is not about time saving

Problem

I am an idiot and will make mistakes

Some suggested languages

Why high level languages?

What about speed?

Problem

I am an idiot and will make mistakes

Share your code and data openly(As possible)

Openly as possible?

Why share the code and data?

You say

You say

My results

My results

My results

Code comparison

Production workflow

Problem

I am an idiot and will make mistakes

Use version control

Is this familiar?

Why version control?

More reasons?

Demo

Use git!

Main git workflow

Just 3 commands

Can't use GitHub?

The version control life cycle

True Story

Problem

I am an idiot and will make mistakes

Environment

Environment

Production

Describe your environment

Running our PCA example

Running our PCA example

Need more?

Problem

I am an idiot and will make mistakes

Get a code buddy (Code Review Light)

Get a code buddy

Problem

I am an idiot and will make mistakes

Literate computing

Literate computing technologies

Problem

I am an idiot and will make mistakes

Afraid to change your code?

Write tests

Problem

I am an idiot and will make mistakes

Numerical Computing is hard!

Hypotenuse of a triangle

A better hypot

Real-world hypot

Implementation details matter

Is this enough?

You are not alone!

Share your code and data openly
(As possible)