background-image: url(photos/colton-sturgeon-6KkYYqTEDwQ-unsplash.jpg) background-size: cover count: false .column.shade_black[.content[ <br> # .monash-white.outline-text[Practical Steps Towards Reproducibility] <h2 class="monash-white outline-text" style="font-size: 30pt!important;"></h2> <br> <h2 style="font-weight:900!important;"></h2> .bottom_abs.width100[ *Dr Patricia Menéndez* Department of Econometrics and Business Statistics <br> Monash University, Melbourne <br> ] ]] --- # You are working on a project and it is going great! <br> <img src="gifs/Happy New Year Celebration GIF by Faith Holland - Find & Share on GIPHY.gif" width="80%" style="display: block; margin: auto;" /> --- # You return to the same project a few months later <br> <br> <img src="photos/martijn-baudoin-4h0HqC3K4-c-unsplash.jpg" width="60%" style="display: block; margin: auto;" /> <br> .small[Photo by <a href="https://unsplash.com/@maguay?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Matthew Guay</a> on <a href="https://unsplash.com/s/photos/messy-cable?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>] --- class: transition, center # Critical issue <br><br> ![](https://media.giphy.com/media/11fDMHAzihB8D6/giphy.gif) <br> .small[https://media.giphy.com/media/11fDMHAzihB8D6/giphy.gif] --- # What could possible go wrong??!! <br> .pull-left[ Which file was the last one? <img src="figs/scott-webb-YBwPrBiccX4-unsplash.jpg" width="50%" style="display: block; margin: auto;" /> .small[Photo by <a href="https://unsplash.com/@scottwebb?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Scott Webb</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] ] -- .pull-right[ Did I modify my raw data? <img src="photos/documents.png" width="40%" style="display: block; margin: auto;" /> ] -- <br> .pull-left[ Got a new computer? <img src="figs/helloquence-5fNmWej4tAA-unsplash.jpg" width="60%" style="display: block; margin: auto;" /> .small[Photo by <a href="https://unsplash.com/@homajob?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Scott Graham</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] ] -- <br> .pull-right[ Copy & paste in the wrong place? <img src="https://media.giphy.com/media/l0HlQXlQ3nHyLMvte/giphy.gif" width="60%" style="display: block; margin: auto;" /> .small[https://media.giphy.com/media/l0HlQXlQ3nHyLMvte/giphy.gif] ] --- # Live our lives peacefully <img src="photos/colton-sturgeon-6KkYYqTEDwQ-unsplash.jpg" width="80%" style="display: block; margin: auto;" /> .small[Photo by <a href="https://unsplash.com/@coltonsturgeon?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Colton Sturgeon</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] --- class: left, top # Reproducible research and replicability <br><br> **Definitions by the USA National Academies of Science, Engineering and Medicine**: <br> - .green[**Reproducibility**] ("computational reproducibility") means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis. - .green[**Replicability**] means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. <br> [Report on reproducibility and replicability](https://www.nap.edu/read/25303/chapter/1#xix) --- class: left, top # ~~Figure it out~~ <br><br> .pull-left[ <img src="figs/ryan-quintal-US9Tc9pKNBU-unsplash.jpg" width="50%" style="display: block; margin: auto;" /> <img src="figs/karla-hernandez-LrlyZzX6Sws-unsplash.jpg" width="50%" style="display: block; margin: auto;" /> ] <br><br> .pull-right[ - **Pieces** - **Instructions/manual** - **Tools** <br><br> .small[Photo by <a href="https://unsplash.com/@xavi_cabrera?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Xavi Cabrera</a> on <a href="https://unsplash.com/s/photos/lego?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>] .small[Photo by <a href="https://unsplash.com/@karlahrnndz?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Karla Hernandez</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] ] <!-- --- --> <!-- class: left, top --> <!-- # Reproducible research --> <!-- * Working to make your research reproducible does require extra upfront effort. --> <!-- * Making a project reproducible from the start encourages you to use better --> <!-- work habits. --> <!-- * It should push you to bring your data and source code up to a higher --> <!-- level of quality than you might if you “thought ‘no one was looking’ ” [Donoho, --> <!-- 2010, 386]. --> <!-- * Reproducible research needs to be stored so that other researchers can actually access the data and source code. --> <!-- * Changes are easier to implement `\(\rightarrow\)` specially when using dynamic reproducible documents. --> <!-- * Reproducible research has higher impact. --> --- class: left, top # Philosophy <br><br> Reproducibility is a way of thinking and approaching projects: <br> .pull-right[ <img src="photos/diego-ph-fIq0tET6llw-unsplash.jpg" width="50%" style="display: block; margin: auto;" /> ] .pull-left[ .content-box-neutral[ * Requires planning. * Needs extra upfront effort. * Demands us to be organized. * Challenges us to think more broadly. ] ] <br> .small[Photo by <a href="https://unsplash.com/@jdiegoph?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Diego PH</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>] --- class: left, top # Reproducibility complexity <br><br> .green[**Complexity varies:**] <br> .content-box-soft[ - Some projects require a single tool (R, Python, Matlab or Genstat for example) and involve only one person. ] .content-box-soft[ - Other projects might involve different teams and require many different tools. ] .pull-right[ <img src="photos/hiclipart.png" width="30%" style="display: block; margin: auto;" /> ] --- # Project example <br> <img src="figs/environmental-data-science-r4ds-general.png" width="85%" style="display: block; margin: auto;" /> .small[Artwork by @allison_horst] --- class: center, middle # Example: R, Rstudio and Rmarkdown files <br> .pull-left[ <img src="figs/first-rmarkdown-code.png" width="90%" /> ] <br> .pull-right[ <img src="figs/first-rmarkdown-output.png" width="90%" /> ] --- # Literate programming <br><br> .content-box-neutral[ Literate programming is an approach to writing reports using software that weaves together the source code and text at the time of creation. ] <br> .blue[Donald Knuth coined the term literate programming in the 1970s to refer to a source file that could be both run by a computer and “woven” with a formatted presentation document [Knuth, 1992].] --- # Complex workflow example <img src="photos/ereefs.png" width="100%" style="display: block; margin: auto;" /> .pull-left[ <img src="photos/ereefs1.png" width="150%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="photos/ereefs2.png" width="70%" style="display: block; margin: auto;" /> ] <br> .small[https://www.ereefs.org.au/about/] --- # Complex projects need more than literate programming <img src="photos/umberto-FewHpO4VC9Y-unsplash.jpg" width="80%" style="display: block; margin: auto;" /> .small[Photo by <a href="https://unsplash.com/@umby?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Umberto</a> on <a href="https://unsplash.com/@umby?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] <br><br> Literate programming --- class: transition <br><br> #Reproducibility set up will depend on the project <br><br> General practical tips for reproducible workflows <br> .orange[There is no one-size-fits-all approach!] --- class: transition # 1. Plan in adavance <br> - Plan the type and scale of the project for reproducibility (as much as possible!). - Brainstorm about how different components are to be connected in a reproducible way. Update when necessary. - .orange[.bold[Think about your "future self" and others.]] - Set up a version control system. - Plan for literate programming and scripting as much as possible. - Keep revising and updating the plans as you go. --- # Workflow organization Decide how you are going to organize your project workflow and prepare the scaffolding for that. <br> - For example, think about the project files structure - In particular, where is the data going to be stored and how? - Where will you place the scripts and codes? - How are other things going to be organized in the project? - How are the different elements going to be linked together? - Where will you place the project documentation? <br> .content-box-soft[ .blue[Think about someone else who have no clue about this project and will need to run it in 3 years time!] ] --- class: middle, top # What is a version control system? <img src="figs/marco-lermer-bHZ2_D4bTfg-unsplash.jpg" width="40%" style="display: block; margin: auto;" /> <br> .content-box-soft[.bold[.green[Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.]]] <br> .green[It also allows us to collaborate and share our projects with others!] .small[<span>Photo by <a href="https://unsplash.com/@malepics?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Marco Lermer</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></span>] --- class: left, middle # Version control <br> - .content-box-soft[Version control systems are a category of software tools that help store and manage changes to source code (projects) over time.] - .content-box-soft[Version control software keeps track of every modification to the source code in a special database.] - .content-box-soft[If a mistake is made, you can turn back to previous versions and compare the code to fix the problem while minimizing disruption.] - .content-box-soft[It is easy to manage multiple versions of a project ] - .content-box-soft[It is a very useful (actually essential!) tool for collaborating and for sharing open source resources.] --- class: left, top # Git <br> .pull-left[ <img src="figs/Git-Logo-Black.png" width="70%" /> ] .pull-right[ <img src="figs/git_tree2.jpg" width="70%" /> <br> ] <br><br> "Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows" .small[https://en.wikipedia.org/wiki/Git] --- class: left, top #GitHub, Bitbucket and others <br> .pull-left[ <img src="figs/bitbucket.png" width="100%" /> ] .pull-right[ <img src="figs/GitHub_Logo.png" width="70%" /> ] <br> * .green[GitHub/Bitbucket] are code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. * Both are cloud-based hosting service that lets you manage Git repositories. --- class: center, top # Git and GitHub <br> <img src="figs/Gitvs.Github-1a.jpg" width="60%" /> .small[Source: https://blog.devmountain.com/git-vs-github-whats-the-difference/] --- # Sharing: team work / open source <br> <img src="figs/sharing.jpg" width="90%" style="display: block; margin: auto;" /> --- # GUI clients .pull-left[ <img src="figs/guis.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/shell.png" width="100%" style="display: block; margin: auto;" /> <br> <br> - Have a look at more GUI clients [here](https://git-scm.com/downloads/guis). <br> - A great Git open source book [here](https://git-scm.com/book/en/v2). ] --- class: transition # 2. File system for the project <br><br> - Organize project files for reproducibility. - Think carefully about files accessibility. - Consider the different types of files in the project. - Prepare for storage requirements. - Think about the connectivity between the different components. .pull-right[ <img src="photos/hiclipart2.png" width="40%" style="display: block; margin: auto;" /> ] --- class: center, top # Computer paths <br> <img src="figs/nathan-anderson-uq5JjGK_4SE-unsplash.jpg" width="60%" /> Where are files and folders stored in our computer? <br> .small[Photo by <a href="https://unsplash.com/es/@nathananderson?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Nathan Anderson</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] --- class: left, top # Computer paths <br><br> * A .purple[_path_ is] the complete location or name of where a computer file, directory, device, or web page is located. ## Examples - .bolder[Windows] `\(\rightarrow\)` .purple[C:\documents\example] - .bolder[Mac/linux] `\(\rightarrow\)` .purple[/Users/documents/example] --- class: left, top # Absolute and relative paths <br> * .green[**An absolute or full path**] contains the computer root's directory and all other sub directories where a file is located. .purple["C:\documents\example\file.txt"] <br> * .green[**Relative path**] refers to a location that is .green[relative to a current directory] (or folder). .purple["example\file.txt"] <br> .content-box-neutral[ A reproducible project should not have absolute paths! ] --- # Neat file system <br> .pull-left[ <img src="figs/omid-kashmari-s34f0Wxbens-unsplash.jpg" width="67%" /> .small[Photo by <a href="https://unsplash.com/@omidkashmari?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Omid Kashmari</a> on <a href="https://unsplash.com/@omidkashmari?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] ] .pull-right[ <img src="figs/sharon-mccutcheon-tn57JI3CewI-unsplash.jpg" width="60%" /> .small[Photo by <a href="https://unsplash.com/@sharonmccutcheon?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexander Grey</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] ] --- class: transition # Project organization: Examples <img src="photos/project1.png" width="80%" style="display: block; margin: auto;" /> --- class: transition # Project organization: Examples <img src="photos/project.png" width="80%" style="display: block; margin: auto;" /> --- class: transition # 3. Accessible connected workflow <img src="photos/tamas-tuzes-katai-KJA1xo42dr4-unsplash.jpg" width="40%" style="display: block; margin: auto;" /> .small[Photo by <a href="https://unsplash.com/@tamas_tuzeskatai?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Tamas Tuzes-Katai</a> on <a href="https://unsplash.com/@tamas_tuzeskatai?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] --- # Data handling <br><br> - Record the **provenance** and licensing of the data. - Use raw data and keep data files as **read-only** whenever possible. - Integrate all the data preparation and data wrangling in the project to ensure reproducibility: <br> .content-box-soft[ * In some projects data wrangling might be done using different software. * When different software is used (SQL/SAS/Excel/Python) all the steps and how they connect must be documented. ] --- class: transition # 4. Documentation <br> - Document for reproducibility from day 0. - Describe all the elements of the project. - Keep a journal for manual steps. - Be specific about documenting how different components are linked together to ensure reproducibility. - Data provenance, licensing, scripts, any tools used in the project. - Clean up when necessary! .pull-right[ <img src="photos/hiclipart2.png" width="30%" style="display: block; margin: auto;" /> ] <br> .orange[Think about your future self and others that might need to use the project.] --- # README example <br> README example from [UNCTAD](https://github.com/dhopp1-UNCTAD/nowcast_data_update): **United Nations Conference on Trade and Development**. <img src="photos/unctad.png" width="60%" style="display: block; margin: auto;" /> .small[https://github.com/dhopp1-UNCTAD/nowcast_data_update] --- # README complex example <br> **AIMS eReefs Visualisations and aggregations platform documentation repository.** <img src="photos/aims.png" width="50%" style="display: block; margin: auto;" /> .small[https://github.com/open-AIMS/ereefs-documentation] <!-- --- --> <!-- class: transition --> <!-- # Practical tips --> <!-- - Plan the type and scale of the project --> <!-- - Think carefully about how the pieces can be connected --> <!-- - From day one: documentation data provenance and licensing/ scripts/ tools ... --> <!-- - Think about your future self and other users --> <!-- - Work maintenance (team) --> <!-- - External services / dependencies --> --- class: transition # 5. Code Environment Container <br> **Have you ever encountered the following issue?** <br> - My code run 6 months ago (.green[I am sure!]) and now it is not working. - My figures look funny ... - Mostly it is all about your program versions!! <br> .pull-right[ <img src="figs/markus-spiske-QtFAXP6z0Wk-unsplash.jpg" width="35%" style="display: block; margin: auto;" /> ] .small[Photo by <a href="https://unsplash.com/@markusspiske?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Markus Spiske</a> on <a href="https://unsplash.com/@markusspiske?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>] --- # Light weight dependency management <br> .content-box-neutral[ There are a couple of R dependency management solutions available: .green[packrat] and .green[renv] packages offer capabilities for dependency management.] <img src="figs/renvlogo.png" width="20%" style="display: block; margin: auto;" /> The idea is to create .green[a project-local-library] to ensure that projects get their own unique library of R packages! --- # renv might not be enough <br> <img src="photos/renvdocker.png" width="100%" style="display: block; margin: auto;" /> <br> .small[https://rstudio.github.io/renv/articles/docker.html] --- # Docker <br> "A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Container images become containers at runtime and in the case of Docker containers – images become containers when they run on Docker Engine." <br> Instructions for container environment. .pull-right[ <img src="figs/Docker_logo_PNG2.png" width="40%" style="display: block; margin: auto;" /> ] .small[https://www.docker.com/resources/what-container/] --- # Record manually <br> <img src="figs/ilya-pavlov-hXrPSgGFpqQ-unsplash.jpg" width="70%" style="display: block; margin: auto;" /> <br> Written instructions .small[Photo by <a href="https://unsplash.com/@ilyapavlov?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Ilya Pavlov</a> on <a href="https://unsplash.com/@ilyapavlov?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> ] --- class: transition # 6. Add a license <br> .content-box-neutral[ <img src="photos/Cc.logo.circle.svg" width="30%" style="display: block; margin: auto;" /> ] .small[By Creative Commons, fixed by Quibik - Official Creative Commons' base, Public Domain, https://commons.wikimedia.org/w/index.php?curid=1484324] --- # Licensing an open source repository <br> Public repos in GitHub make your work publicly available and therefore it is important to establish how your work should be acknowledged if someone else wants to use it. <br> .content-box-neutral[ "Public repositories on GitHub are often used to share open source software. For your repository to truly be open source, you'll need to license it so that others are free to use, change, and distribute the software." ] [More info here]( https://help.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository) --- # Available licenses in GitHub <br> <img src="figs/githubli.png" width="100%" style="display: block; margin: auto;" /> <br> [More info here](https://help.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository#detecting-a-license) --- # Choose an open source license <img src="figs/chooselicense.png" width="100%" style="display: block; margin: auto;" /> [Source here](https://choosealicense.com/) --- # No license <img src="figs/no-permission.png" width="100%" style="display: block; margin: auto;" /> [Source here](https://choosealicense.com/no-permission/) --- class: transition # Recommendations summary <br> - 1: Plan in advance - 2: Consider adequate file system for the project - 3: Create accessible connected workflows - 4: Document, document, document - 5: Consider using a code environment container - 6: Add a license .pull-right[ <img src="photos/hiclipart2.png" width="35%" style="display: block; margin: auto;" /> ] <br> .orange[Think about all this in advance. It is worth your while!] --- class: transition ## Many thanks! <img src="photos/umberto-FewHpO4VC9Y-unsplash.jpg" width="65%" style="display: block; margin: auto;" /> patricia.menendez@monash.edu <br> @PM_maths <br> https://okayama1.github.io/Australassian-applied-statistics-conference-talk/