Data Management, Reproducibility & Research Integrity

Updated: January 2025 · Reading time: 10–12 minutes

1. Why Data Management Matters for Your PhD

Many PhD projects start informally: files on a laptop, a few spreadsheets, some scripts in random folders. This may work for a few weeks, but over several years it becomes a serious risk to your thesis, your publications, and your reputation.

Good data management is not extra work; it is how you protect your results, reuse your own work efficiently, and show others that your conclusions can be trusted.

2. Structuring Your Project Folders

Create a consistent structure for each project so you and your collaborators can quickly understand where everything lives. For example:

data_raw/ — original, unmodified data.
data_processed/ — cleaned and derived datasets.
scripts/ — analysis and processing code.
results/ — figures, tables, model outputs.
docs/ — notes, protocols, ethics approvals.

Add a short README.md in the main folder explaining the structure and how to run your analysis.

3. Version Control for Code and Text

Use a version control system such as Git to track changes to your code, analysis notebooks, and even parts of your thesis. This allows you to:

See what changed, when, and why.
Experiment safely using branches.
Collaborate without overwriting each other’s work.

Hosting platforms (e.g., GitHub, GitLab, institutional servers) can be used privately during the project and made public later, depending on your data and agreements.

4. Documentation and Reproducible Workflows

Reproducibility means that someone with access to your data and code could obtain the same results you report. To move in this direction:

Write scripts that go from raw data to final figures where possible.
Record software versions, packages, and important settings.
Comment your code with the why, not only the how.
Store analysis decisions (e.g., exclusion criteria) in a document.

5. Backups and Security

Data loss can delay your PhD by months. Use at least the “3‑2‑1” rule:

3 copies of your data,
stored on 2 different types of media,
with 1 copy off‑site or in the cloud.

For sensitive or personal data, follow your institution’s security requirements and anonymisation procedures. Never store such data on unencrypted devices or personal clouds without approval.

6. Ethical and Legal Considerations

Research integrity is about more than avoiding fabrication or plagiarism. It includes respecting participants, collaborators, and funders. Make sure you:

Have ethics approval where required and follow the protocol exactly.
Obtain informed consent and respect withdrawal requests.
Handle anonymisation and de‑identification carefully.

When in doubt, ask your supervisor or ethics office before collecting or sharing data.

7. Sharing Data and Code Responsibly

Many journals and funders now encourage or require data and code sharing. When allowed, using repositories (e.g., institutional repositories, discipline‑specific archives) increases the visibility and impact of your work.

If full sharing is not possible (e.g., due to privacy), consider sharing simulated data, partial datasets, or detailed protocols so others can still understand your methods.

8. Building Integrity into Everyday Practice

Integrity is not just a final check before publication; it is a daily habit. Small actions add up:

Record decisions immediately after you make them.
Be honest about limitations and unexpected results.
Credit others’ work clearly in your writing and presentations.

By investing in data management and reproducible practices, you protect your thesis, strengthen your publications, and contribute to a more trustworthy academic ecosystem.

← Back to all articles