When you're starting out in a new field, it can be difficult to keep experimental and production code separate. Using Jupyter notebooks in my development stack has helped me experiment and organise my thoughts without cluttering up my codebase.
Like the rest of the Python community, recently I have been noodling around with a few ideas around Large Language Models, semantic search and the like. Having first played with vector spaces back in the early noughties, I'm initally drawn to the possibilites around sentence embeddings and using them for Retrieval Augmented Generation (RAG), semantic searches, etc.
The thing is, this is going to take a lot of trial and error. It's pretty obvious what production stack I need - a Python API framework (FastAPI) and a vector store (Weaviate). Those can go straight into my docker-compose.yml file. But I'm going to need to experiment with a lot of different models and approaches. I have clients with large corpuses of PDF files, so it makes sense to start with those. While getting text out of these is relatively straightforward, you can't preserve the formatting of the document beyond sentences and page numbers. Titles, image captions, bullet points, tables are all included in the raw text but not in any kind of structure.
So, when faced with a long stream of text, how best to chunk it up into meaningful semantic parcels? The LangChain approach seems to be to choose a context window (e.g. 250 words) and then split the text into chunks of that size with an overlap either side. Playing with different context window sizes and overlaps is going to be a big part of the experimentation process. An alternative is to calculate each sentence's similarity with its neighbours, and use that to split the text into semantically-similar chunks of varying length, which may or may not match up with the paragraphs in the original document.
All this means a lot of trial and error, with loads of code being written and discarded. Ideally, I don't want to clutter the production codebase with all this experimental stuff, or have to worry at this point about where each bit fits in with a particular framework or API structure.
In the past, this has led to a lot of seemingly-random Python files or redundant request stubs if the experimental code relies on anything from the framework. There's loads of commented-out code because if a certain approach doesn't work the first time, you don't want to delete all that effort in case it's useful later on. I still need this code to be included in version control, but I don't want it to be cluttering up production code.
Initial development is more of a stream of consciousness, with lots of trial and error, no deleting of bad ideas, and then once I have a working approach I can refactor it into a more structured form.
And this is where Jupyter notebooks come in. It's really easy to add an extra image to my development docker-compose.yml - there's a choice of a few Jupyter Docker images, depending on how much data science you want to get into. Once this is added and built, it uses the same requirements.txt file as the rest of the stack, so you can keep consistent dependencies across the project.
docker-compose.yml
…
jupyter:
platform: linux/amd64
build:
context: .
dockerfile: jupyter.Dockerfile
restart: on-failure
command: start.sh jupyter notebook --NotebookApp.token="${JUPYTER_TOKEN}" --no-browser --allow-root
volumes:
- ./_notebooks:/home/jovyan/work
ports:
- "8888:8888"
…
jupyter.Dockerfile
FROM jupyter/datascience-notebook:latest
USER root
RUN apt-get update
RUN apt-get install -y poppler-utils
RUN apt-get install -y libmagic1
RUN apt-get clean
WORKDIR /home/jovyan/work
COPY ./requirements.txt /home/jovyan/work/requirements.txt
RUN --mount=type=cache,mode=0777,target=/root/.cache/pip pip install --no-cache-dir --upgrade -r requirements.txt
COPY ./_notebooks /home/jovyan/work
Then, you can start creating notebooks to experiment with code, and it's easy to keep track of what you've done. You can add markdown cells to explain what you're doing, and you can add code cells to try out different approaches. You can add images and videos to the notebook, so you can see how the results look. I have found that it changes the way my process works. Initial development is more of a stream of consciousness, with lots of trial and error, no deleting of bad ideas, and then once I have a working approach I can refactor it into a more structured form.
I tend to split different ideas up into different notebooks. And it can all be done within Visual Studio Code (which supports Jupyter notebooks after a bit of setup), so I still have access to the witchcraft of GitHub Copilot.
I'm still at the early stages of this, but I'm finding it a really useful addition to my development stack. It's not quite the utopia of Donald Knuth's vision for Literate programming, but I'm sure there are loads of other ways to use Jupyter notebooks, and I'm looking forward to finding out more.
QUICK TIP: Remove output cells from notebooks before checking in to Git
docker-compose --env-file dev.env exec jupyter jupyter nbconvert --ClearOutputPreprocessor.enabled=True --clear-output *.ipynb
If you would like to find out more about Python development, please do not hesitate to get in touch.