Skip to content
Libro Library Management System
FAIRJupyter4AI: A Corpus of Computational Notebooks for AI cover
Bibliographic record

FAIRJupyter4AI: A Corpus of Computational Notebooks for AI

Authors
Daniel Mietchen, Sheeba Samuel
Publication year
2025
OA status
gold
Print

Need access?

Ask circulation staff for physical copies or request digital delivery via Ask a Librarian.

Digital copy

Unavailable in your region (PD status unclear).

Abstract

Computational notebooks like Jupyter have transformed scientific and educational workflows in computational fields by combining code, text, and visualizations. They have also become a popular mechanism to share computational workflows. However, ensuring their reproducibility remains a persistent challenge due to often insufficiently documented direct and indirect dependencies, missing data, and inconsistencies in execution environments. Existing datasets lack the multimodal, fine-grained structure needed for AI applications. FAIRJupyter4AI aims to address this gap by creating a large-scale, AI-ready corpus of Jupyter notebooks enriched with executable code, markdown, outputs, and structured annotations. The project integrates these into a hybrid knowledge graph (KG) that incorporates symbolic, statistical, and execution-based representations. Key objectives include: curating diverse notebooks (initially Python, later R, with provisions for additional languages); automating reproducibility testing; building a KG for cross-notebook queries; training AI models for tasks like error repair and notebook generation; and fostering community use via APIs and integration with community platforms like NFDI or Hugging Face.The project will be implemented using the infrastructure established by the NFDI Basic Service Jupyter4NFDI, in the upcoming Integration Phase of which (October 2025-September 2027) the applicants are actively involved. Its central JupyterHub provides cross-consortial and cross-institutional access to scalable computing and data resources and associated software stacks for both research and training purposes.The FAIRJupyter4AI work programme is structured around five interlinked work packages: (1) Data Collection & Curation, (2) Reproducibility Assessment, (3) Knowledge Graph Development (4) AI Model Training, and (5) Communication, Community & Sustainability. Key innovations include continuous updates and enrichment pipelines (avoiding static snapshots), unifying multimodal content for AI, and bridging reproducibility with AI. Building on prior work involving 27,000+ notebooks and the FAIR Jupyter Knowledge Graph, FAIRJupyter4AI will curate, annotate and release over 20,000 notebooks that are research-related and openly licensed. In addition, we will share a metadata corpus for 50,000 research-related notebooks, along with open-source tools, models, and associated documentation. By making Jupyter notebooks metadata FAIR, reusable, and machine-understandable, this project will set a new standard for reproducible and AI-enhanced computational science, and it will open up new opportunities for learning and teaching about computational reproducibility across multiple domains of research.

Copies & availability

Realtime status across circulation, reserve, and Filipiniana sections.

Self-checkout (no login required)

  • Enter your student ID, system ID, or full name directly in the table.
  • Provide your identifier so we can match your patron record.
  • Choose Self-checkout to send the request; circulation staff are notified instantly.
Barcode Location Material type Status Action
No holdings recorded.

Digital files

Preview digitized copies when embargo permits.

Links & eResources

Access licensed or open resources connected to this record.

  • oa Direct
  • oa Direct
  • oa Direct