bachelor-project/bachelor-project-nikolaj.tex

\documentclass[a4paper,11pt]{article}
\usepackage[margin=1.3in]{geometry}
\usepackage[most]{tcolorbox}
\usepackage{xcolor}
\usepackage{tikz}
\usepackage{fancyhdr} % for headers
% \usepackage[citestyle=verbose-ibid, backend=biber, autocite=footnote]{biblatex} % Footnote references. Use autocite{}.
\usepackage{biblatex}
\usepackage{float}
\usepackage{fontspec}

% --- Configuration ---
\bibliography{src/references}
\setmonofont[Scale=0.85, ItalicFont=Hermit Light]{Hermit Light}
% \pagestyle{fancy}
% \setlength{\parskip}{6pt}
% \setlength{\parindent}{0pt}

% \fancyfoot{}
% \lhead{\rightmark}
% \rhead{\thepage}
% \fancyheadoffset{0.005\textwidth}

\setlength{\parskip}{5pt}

\begin{document}
    \section{Abstract}
    \begin{tcolorbox}[colback=lightgray!30!white]
    Explain briefly the paper and what it does.
    \end{tcolorbox}

    \section{Introduction}

    \textit{Scientific Workflow Management Systems} (SWMSs) are an essential tool for automating, managing, and executing complex scientific processes involving large volumes of data and computational tasks\footnote{citation?}. Traditional SWMSs employ a linear sequential approach, in which tasks are performed in a pre-defined order, as defined by the workflow. While this linear method is suitable for certain applications, it might not always be the best choice: processing sequentially can prove inefficient in cases where the next step of the process should adapt to the previous one. For these use-cases a dynamic scheduler is required, of which \textit{Managing Event Oriented Workflows}\autocite{DavidMEOW} (MEOW) is one.

    \begin{tcolorbox}[colback=lightgray!30!white]
        Expand on DAGs' inability to adapt
    \end{tcolorbox}

    MEOW employs an event-based scheduler, in which jobs are performed non-linearly, triggered based on events\footnote{citation?}. By dynamically adapting the execution order based on the outcomes of previous tasks or external factors, MEOW provides a more efficient and flexible solution for processing large volumes of experimental data\footnote{citation?}.


    \begin{tcolorbox}[colback=lightgray!30!white]
        \begin{itemize}
            \item What work am I doing on MEOW?
            \item How did it go?
            \item Introduce the concept of network events.
            \item \textbf{Write this last}
        \end{itemize}
    \end{tcolorbox}

    \subsection{Problem}

    In its current implementation, MEOW is able to trigger jobs based on changes to monitored local files. This covers a the range of scenarios where the data processing workflow involves the creation, modification, or removal of files. By monitoring file events, MEOW's event-based scheduler can dynamically execute tasks as soon as the required conditions are met, ensuring efficient and timely processing of the data. Since the file monitor is triggered by changes to local files, MEOW is limited to local workflows.

    While file events work well as a trigger on their own, there are several scenarios where a different trigger would be preferred or even required, especially when dealing with distributed systems or remote operations. To address these shortcomings and further enhance MEOW's capabilities, the integration of network event triggers would provide significant benefits in several key use-cases.

    Firstly, network event triggers would allow for manual triggering of jobs remotely, without the need for direct access to the monitored files. This is particularly useful in scenarios where human intervention or decision-making is required before proceeding with the subsequent steps in a workflow. While it is possible to manually trigger job using file events by making changes to the monitored directories, this might lead to an already running job accessing the files at the same time, which could cause problems with data integrity.

    Secondly, incorporating network event triggers would facilitate seamless communication between parallel runners, ensuring that tasks can efficiently exchange information and synchronize their progress.

    Finally, extending MEOW's event-based scheduler to support network event triggers would enable the simple and efficient exchange of data between workflows running on different machines. This feature is particularly valuable in distributed computing environments, where data processing tasks are often split across multiple systems to maximize resource utilization and minimize latency.

    Integrating network event triggers into MEOW would provide an advantage specifically in the context of heterogeneous workflows, which incorporate a mix of different tasks running on diverse computing environments. By their nature, these workflows can involve tasks running on different systems, potentially even in different physical locations, which need to exchange data or coordinate their progress. Currently, MEOW's reliance on local file events as triggers can be a limiting factor in these scenarios. Network event triggers offer a powerful solution to this challenge. They can not only handle tasks running across different machines, but also dynamically adapt to the changing requirements of a heterogeneous workflow, such as triggering new tasks based on the results of remote computations. Thus, the addition of network event triggers is a significant step in enhancing MEOW's already robust handling of heterogeneous workflows, bolstering its utility in today's diverse and distributed computing landscape.

    \begin{figure}[H]
        \begin{center}
            \includegraphics[width=\textwidth]{src/heterogeneous.png}
        \end{center}
        \caption{An example of a heterogeneous workflow}
    \end{figure}

    \subsection{Background}
    \subsubsection{The structure of MEOW}

    The MEOW event-based scheduler consists of four main components: \textit{monitors}, \textit{handlers}, \textit{the conductor}, and \textit{the runner}.

    Monitors listen for triggering events. They are initialized with a number of \textit{patterns}, which describe the triggering event. When a pattern's triggering event occurs, the monitor signals to the conductor that the pattern has been triggered, and schedules a job that has been associated with the pattern.

    \begin{figure}[H]
        \begin{center}
            \includegraphics[width=0.6\textwidth]{src/monitor.png}
        \end{center}
        \caption{The monitor's role in MEOW's event-based system.}
    \end{figure}

    \begin{tcolorbox}[colback=blue!30!white]
        I haven't used "Resources" to describe the job queue. Should I do that or should I rephrase the diagram to be more in line with the rest of the project?
    \end{tcolorbox}

    Handlers perform actions and jobs on behalf of the scheduler. They are initialized with a number of \textit{recipes}, which describe the action to be taken. The handler starts a job when signal to do so by the conductor.

    The conductor handles the jobs queue. It is initialized with a number of rules, which a pattern paired with a recipe. When a monitor sends it a triggered pattern, the rules are checked for that pattern. If one or more rules contain that pattern, the corresponding recipes are triggered in their handler.

    Finally, the runner is the main program that orchestrates all these components. Each instance of the runner incorporates at least one instance of a monitor, handler, and conductor.

    \begin{figure}[H]
        \begin{center}
            \begin{tikzpicture}
                \node[draw,rectangle,rounded corners,text width=8cm,align=center] at (0,2) (run) {Runner};
                \node[draw,rectangle,rounded corners] at (0,0) (con) {Conductor};
                \node[draw,rectangle,rounded corners] at (3,-2) (mon) {Monitor};
                \node[draw,rectangle,rounded corners] at (-3,-2) (han) {Handler};
            \end{tikzpicture}
        \end{center}
        \caption{\textbf{WIP.} How the elements of MEOW interact.}
    \end{figure}

    \subsubsection{The \texttt{meow\_base} codebase}

    \texttt{meow\_base}\autocite{MeowBase} is an implementation of MEOW written in python. It is written to be modular, using base classes for each element in order to ease the implementation of additional handlers, monitors, etc.

    \begin{tcolorbox}[colback=blue!30!white]
        How much should I include here?
    \end{tcolorbox}

    \begin{tcolorbox}[colback=lightgray!30!white]
        \begin{itemize}
            \item The runner (brief)
            \item Conductors (brief)
            \item Recipes and handlers (brief)
            \item File event monitor (Watchdog)
            \item Events (important to clarify how file events work since I refer to it in the method section)
            \item Testing (brief)
        \end{itemize}
    \end{tcolorbox}

    \subsubsection{The \texttt{socket} library}

    The \texttt{socket} library\autocite{SocketDoc}, included in the Python Standard Library, serves as an interface for the Berkeley sockets API. The Berkeley sockets API, originally developed for the Unix operating system, has become the standard for network communication across multiple platforms. It allows programs to create 'sockets', which are endpoints in a network communication path, for the purpose of sending and receiving data.

    Many other libraries and modules focusing on transferring data exist for Python, some of which may be better in certain MEOW use-cases. The \texttt{ssl} library, in specific, allows for ssl-encrypted communication, which may be a requirement in workflows with sensitive data. However, implementing network triggers using the \texttt{socket} library will provide MEOW with a fundamental implementation of network events, which can later be expanded or improved with other features.

    In my project, all sockets use the Transmission Control Protocol (TCP), which ensures safe data transfer by enforcing a stable connection between the sender and receiver.

    I make use of the following socket methods, which have the same names and functions in the \texttt{socket} library and the Berkeley sockets API:

    \begin{itemize}
        \setlength{\itemsep}{-5pt}
        \item \texttt{bind()}: Associates the socket with a given local IP address and port. It also reserves the port locally.
        \item \texttt{listen()}: Puts the socket in a listening state, where it waits for a sender to request a TCP connection to the socket.
        \item \texttt{accept()}: Accepts the incoming TCP connection request, creating a connection.
        \item \texttt{recv()}: Receives data from the given socket.
        \item \texttt{close()}: Closes a connection to a given socket.
    \end{itemize}

    During testing of the monitor, the following methods are used to send data to the running monitor:

    \begin{itemize}
        \setlength{\itemsep}{-5pt}
        \item \texttt{connect()}: Sends a TCP connection request to a listening socket.
        \item \texttt{sendall()}: Sends data to a socket.
    \end{itemize}

    \section{Method}

    To address the identified limitations of MEOW and to expand its capabilities, I will be incorporating network event triggers into the existing event-based scheduler, to supplement the current file-based event triggers. My method focuses on leveraging Python's socket library to enable the processing of network events. The following subsections detail the specific methodologies employed in expanding the codebase, the design of the network event trigger mechanism, and the integration of this mechanism into the existing MEOW system.

    \subsection{Design of the network event pattern}
    In the implementation of a pattern for network events, a key consideration was to integrate it seamlessly with the existing MEOW codebase. This required designing the pattern to behave similarly to the file event pattern when interacting with other elements of the scheduler. A central principle in this design was maintaining the loose coupling between patterns and recipes, minimizing direct dependencies between separate components. While this might not be possible for every theoretical recipe and pattern, designing for it could greatly improve future compatibility.

    Network event patterns are initialized with a triggering port, analogous to the triggering path used in file event patterns. This approach inherently limits the number of unique patterns to the number of ports that can be opened on the machine. However, given the large number of potential ports, this constraint is unlikely to present a practical issue. An alternative approach could have involved triggering patterns using a part of the sent message, essentially acting as a "header". However, this would complicate the process since the monitor is otherwise designed to receive raw data. To keep the implementation as straightforward as possible and to allow for future enhancements, I opted for simplicity over complexity in this initial design.

    Once the network monitor is started, it opens sockets that start listening on the each of the ports specified in the patterns it was initialized with. This is consistent with the behavior of the file event monitor, which monitors the triggering paths of the patterns it was initialized with.

    \subsection{Integrating network events into the existing codebase}
    The data received by the network monitor is written to a temporary file, a design choice that serves two purposes.

    Firstly, this method is a practical solution for managing memory usage during data transfer, particularly for large data sets. By writing received data directly to a file, we bypass the need to store the entire file in memory at once, effectively addressing potential memory limitations.

    Secondly, this approach allows the leveraging of existing infrastructure built for file events. The newly written temporary file is passed as the "triggering path" of the event, mirroring the behavior of file events. This approach allows network events to utilize the recipes initially designed for file events without modification, preserving the principle of loose coupling. This integration maintains the overall flexibility and efficiency of MEOW while extending its capabilities to handle network events.

    \subsection{Testing}

    \section{Results}
    \begin{tcolorbox}[colback=lightgray!30!white]
    Does it work? How well?
    \end{tcolorbox}

    % \subsection{Testing}

    \subsection{Discussion}
    \begin{tcolorbox}[colback=lightgray!30!white]
    With the hindsight of the results, what could I have done better?
    \end{tcolorbox}

    \subsection{Future Work}
    \begin{tcolorbox}[colback=lightgray!30!white]
    What should someone do if they want to fix my mistakes, or expand on them further.
    \begin{itemize}
        \item Implementation of the other options mentioned when discussing the socket library.
        \item Triggering on a header item in addition to port
    \end{itemize}
    \end{tcolorbox}

    \begin{tcolorbox}[colback=lightgray!30!white]
        Give context to following paragraph.
    \end{tcolorbox}
    One specific example of a use-case where network event triggers could prove useful is the workflow for The Brain Imaging Data Structure (BIDS). The BIDS workflow requires data to be sent between multiple machines and validated by a user. Network event triggers could streamline this process by automatically initiating data transfer tasks when specific conditions are met, thereby reducing the need for manual management. Additionally, network triggers could facilitate user validation by allowing users to manually prompt the continuation of the workflow through specific network requests, simplifying the user's role in the validation process.

    \begin{figure}[H]
        \begin{center}
            \includegraphics[width=0.6\textwidth]{src/BIDS.png}
        \end{center}
        \caption{\textbf{Temp.} The structure of the BIDS workflow. Data is transferred to user, and to the cloud.}
    \end{figure}

    \section{Conclusion}
    \begin{tcolorbox}[colback=lightgray!30!white]
    Did I succeed in what I wanted to do?
    \end{tcolorbox}

    \newpage
    \appendix
    \printbibliography{}
\end{document}