✨

2023-05-30 10:38:56 +02:00
parent 63eb4fe3c8
commit 32dc47742e
2 changed files with 90 additions and 33 deletions
--- a/bachelor-project-nikolaj.pdf
+++ b/bachelor-project-nikolaj.pdf
--- a/bachelor-project-nikolaj.tex
+++ b/bachelor-project-nikolaj.tex
@ -34,34 +34,75 @@

    \section{Introduction}

-    \textit{Scientific Workflow Management Systems} (SWMSs) are an essential tool for automating, managing, and executing complex scientific processes involving large volumes of data and computational tasks\footnote{citation?}. Traditional SWMSs employ a linear sequential approach, in which tasks are performed in a pre-defined order, as defined by the workflow. While this linear method is suitable for certain applications, it might not always be the best choice: processing sequentially can prove inefficient in cases where the next step of the process should adapt to the previous one. For these use-cases a dynamic scheduler is required, of which \textit{Managing Event Oriented Workflows}\autocite{DavidMEOW} (MEOW) is one.
+    \textit{Scientific Workflow Management Systems} (SWMSs) are an essential tool for automating, managing, and executing complex scientific processes involving large volumes of data and computational tasks\footnote{citation?}. Jobs in a SWMS workflows are typically defined as the nodes in a Directed Acyclic Graph (DAG), where the edges define the dependencies of each job.

-    \begin{tcolorbox}[colback=lightgray!30!white]
-        Expand on DAGs' inability to adapt. Plagiarize David's thesis.
-    \end{tcolorbox}
+    \begin{figure}[H]
+        \begin{center}
+            \begin{tikzpicture}[
+                arrow/.style={-Triangle, thick,shorten >=4pt}
+                ]
+                \node[draw,circle] at (0,0) (j1) {Job 1};
+                \node[draw,circle] at (3,2) (j2) {Job 2};
+                \node[draw,circle] at (3,0) (j3) {Job 3};
+                \node[draw,circle] at (3,-2) (j4) {Job 4};
+                \node[draw,circle] at (6,1) (j5) {Job 5};
+                \node[draw,circle] at (9,-0.5) (j6) {Job 6};

-    MEOW employs an event-based scheduler, in which jobs are performed non-linearly (\textbf{Better word here}), triggered based on events\footnote{citation?}. By dynamically adapting the execution order based on the outcomes of previous tasks or external factors, MEOW provides a more efficient and flexible solution for processing large volumes of experimental data\footnote{citation?}.
+                \draw[arrow] (j1) -- (j2);
+                \draw[arrow] (j1) -- (j3);
+                \draw[arrow] (j1) -- (j4);
+                \draw[arrow] (j2) -- (j5);
+                \draw[arrow] (j3) -- (j5);
+                \draw[arrow] (j4) -- (j6);
+                \draw[arrow] (j5) -- (j6);
+            \end{tikzpicture}
+            \caption{A workflow defined as a DAG. Job 2, 3, and 4 are dependent on the completion of Job 1, etc.}
+        \end{center}
+    \end{figure}

+    While this method is suitable for many applications, it may not always be the best solution. Processing the jobs in a set order can lead to inefficiencies in cases where the processing of the jobs needs to adapt based on the results of earlier jobs, human interaction, or changing circumstances. In these contexts, the DAG method might fall short due to its inherently static nature.

-    \begin{tcolorbox}[colback=lightgray!30!white]
-        \begin{itemize}
-            \item Expand on what "efficient" is
-            \item What work am I doing on MEOW?
-            \item How did it go?
-            \item Introduce the concept of network events.
-            \item \textbf{Write this last}
-        \end{itemize}
-    \end{tcolorbox}
+    In such scenarios, using a \textit{dynamic scheduler} can offer a more effective approach. Unlike traditional DAG-based systems, dynamic schedulers are designed to adapt dynamically to changing conditions, providing a more adaptive method for managing complex workflows. One such dynamic scheduler is the \textit{Managing Event Oriented Workflows}\autocite{DavidMEOW} (MEOW).
+
+    MEOW employs an event-based scheduler, in which jobs are executed independently, based on certain \textit{triggers}. Triggers can in theory be anything, but are currently limited to file events on local storage. By dynamically adapting the execution order based on the outcomes of previous tasks or external factors, MEOW provides a more flexible solution for processing large volumes of experimental data, with minimal human validation and interaction.\footnote{citation?}.
+
+    \begin{figure}[H]
+        \begin{center}
+            \begin{tikzpicture}[
+                arrow/.style={-Triangle, thick,shorten >=4pt}
+                ]
+                \node[draw,rectangle] at (0,0) (t1) {Trigger 1};
+                \node[draw,rectangle] at (0,-1.5) (t2) {Trigger 2};
+                \node[draw,rectangle] at (0,-3) (t3) {Trigger 3};
+                \node[draw,rectangle] at (0,-4.5) (t4) {Trigger 4};
+
+                \node[draw,circle] at (6,0) (j1) {Job 1};
+                \node[draw,circle] at (6,-1.5) (j2) {Job 2};
+                \node[draw,circle] at (6,-3) (j3) {Job 3};
+                \node[draw,circle] at (6,-4.5) (j4) {Job 4};
+
+                \draw[arrow] (t1) -- (j1);
+                \draw[arrow] (t2) -- (j2);
+                \draw[arrow] (t3) -- (j3);
+                \draw[arrow] (t4) -- (j4);
+            \end{tikzpicture}
+            \caption{A workflow using an event-based system. Job 1 is dependent on Trigger 1, etc.}
+        \end{center}
+    \end{figure}
+
+    In this project, I introduce triggers for network events into MEOW. This enables a running scheduler to react to and act on data transferred over a network connection. By incorporating this feature, the capability of MEOW is significantly extended, facilitating the management of not just local file-based workflows, but also complex, distributed workflows involving communication between multiple systems over a network.
+
+    In this report, I will walk through the design and implementation process of this feature, detailing the challenges encountered and how they were overcome.

    \subsection{Problem}

-    In its current implementation, MEOW is able to trigger jobs based on changes to monitored local files. This covers a the range of scenarios where the data processing workflow involves the creation, modification, or removal of files. By monitoring file events, MEOW's event-based scheduler can dynamically execute tasks as soon as the required conditions are met, ensuring efficient and timely processing of the data. Since the file monitor is triggered by changes to local files, MEOW is limited to local workflows.
+    In its current implementation, MEOW is able to trigger jobs based on changes to monitored local files. This covers a range of scenarios where the data processing workflow involves the creation, modification, or removal of files. By monitoring file events, MEOW's event-based scheduler can dynamically execute tasks as soon as the required conditions are met, ensuring efficient and timely processing of the data. Since the file monitor is triggered by changes to local files, MEOW is limited to local workflows.

    While file events work well as a trigger on their own, there are several scenarios where a different trigger would be preferred or even required, especially when dealing with distributed systems or remote operations. To address these shortcomings and further enhance MEOW's capabilities, the integration of network event triggers would provide significant benefits in several key use-cases.

-    Firstly, network event triggers would allow for manual triggering of jobs remotely, without the need for direct access to the monitored files. This is particularly useful in human-in-the-loop scenarios, where human intervention or decision-making is required before proceeding with the subsequent steps in a workflow. While it is possible to manually trigger job using file events by making changes to the monitored directories, this might lead to an already running job accessing the files at the same time, which could cause problems with data integrity.
+    Firstly, network event triggers would enable the initiation of jobs remotely through the transmission of a triggering message to the monitor, thereby eliminating the necessity for direct access to the monitored files. This is particularly useful in human-in-the-loop scenarios, where human intervention or decision-making is required before proceeding with the subsequent steps in a workflow. While it is possible to manually trigger job using file events by making changes to the monitored directories, this might lead to an already running job accessing the files at the same time, which could cause problems with data integrity.

-    Secondly, incorporating network event triggers would facilitate seamless communication between parallel runners, ensuring that tasks can efficiently exchange information and updates on their progress, allowing for a better perspective on the whole workflow, greatly improving visibility and control.
+    Secondly, incorporating network event triggers would facilitate seamless communication between parallel workflows, ensuring that tasks can efficiently exchange information and updates on their progress, allowing for a better perspective on the combined workflow, greatly improving visibility and control.

    Finally, extending MEOW's event-based scheduler to support network event triggers would enable the simple and efficient exchange of data between workflows running on different machines. This feature is particularly valuable in distributed computing environments, where data processing tasks are often split across multiple systems to maximize resource utilization and minimize latency.

@ -83,9 +124,9 @@

    Monitors listen for triggering events. They are initialized with a number of \textit{rules}, which each include a \textit{pattern} and \textit{recipe}. \textit{Patterns} describe the triggering event. For file events, the patterns describe a path that should trigger the event when changed. \textit{Recipes} describe the specific action that should be taken when the rule is triggered. When a pattern's triggering event occurs, the monitor sends an event, which contains the rule and the specifics of the event, to the event queue.

-    Handlers manage the event queue. They unpack and analyze events in the event queue. If they are valid, a job is created from the recipe, which is then sent to the job queue.
+    Handlers manage the event queue. They unpack and analyze events in the event queue. If they are valid, they create a directory containing the script defined by the recipe. The location of the directory is then sent to the runner, to be added to the job queue.

-    Conductors manage the jobs queue. They execute the jobs that have been created by the handlers.
+    Conductors manage the jobs queue. They execute the jobs in the locations specified by the handlers.

    Finally, the runner is the main program that orchestrates all these components. Each instance of the runner incorporates at least one instance of a monitor, handler, and conductor, and it holds the event and job queues.

@ -107,7 +148,7 @@

                \draw[arrow] (mon) -- (eq) node[pos=0.5,above left=-10pt,text width=2cm, align=center] {Schedules events};
                \draw[arrow] (eq) -- (han) node[pos=0.8,below left=-20pt,text width=2cm, align=center] {Pulls events};
-                \draw[arrow] (han) -- (jq) node[pos=0.2,right,text width=2cm, align=center] {Schedules job};
+                \draw[arrow] (han) -- (jq) node[pos=0.2,right,text width=2cm, align=center] {Schedules jobs};
                \draw[arrow] (jq) -- (con) node[pos=0.5,above right=-10pt,text width=2cm, align=center] {Pulls jobs};
            \end{tikzpicture}
        \end{center}
@ -145,34 +186,35 @@

    The relevant parts of the implementation are:
    \begin{itemize}
-        \setlength{\itemsep}{-5pt}
+        \setlength{\itemsep}{0pt}
        \item \textbf{Events} are python dictionaries, containing the following items:\begin{itemize}[topsep=-10pt]
            \setlength{\itemsep}{-5pt}
            \item \texttt{EVENT\_PATH}: The path of the triggering file.
-            \item \texttt{EVENT\_TYPE}: The type of event, e.g. \texttt{"watchdog"}.
+            \item \texttt{EVENT\_TYPE}: The type of event. File events have the type \texttt{"watchdog"}, since the files are monitored using the \texttt{watchdog} python module.
            \item \texttt{EVENT\_RULE}: The rule that triggered the event, which contains the recipe that the handler will turn into a job.
            \item \texttt{EVENT\_TIME}: The timestamp of the triggering event.
            \item Any extra data supplied by the monitor. File events are by default initialized with the base directory of the event and a hash of the event's triggering path.
        \end{itemize}
-        \item \textbf{The file event monitor} inherits from the \texttt{BaseMonitor} class. It uses the \texttt{Watchdog} module to monitor given directories for changes. The Watchdog monitor is initialized with an instance of the \texttt{WatchdogEventHandler} class as its event handler. When the Watchdog monitor is triggered by a file event, the \texttt{handle\_event} method is called on the event handler, which in turn creates an \texttt{event} based on the specifics of the triggering event. The event is then sent to the runner to be put in the even queue.
+        \item \textbf{Event patterns} inherit from the \texttt{BasePattern} class. An instance of an event pattern class describes a specific trigger a monitor should be looking for.
+        \item \textbf{Monitors} inherit from the \texttt{BaseMonitor} class. They listen for set triggers (defined by given event patterns), and create events when those triggers happen. The file event monitor uses the \texttt{Watchdog} module to monitor given directories for changes. The Watchdog monitor is initialized with an instance of the \texttt{WatchdogEventHandler} class to handle the watchdog events. When the Watchdog monitor is triggered by a file event, the \texttt{handle\_event} method is called on the event handler, which in turn creates an \texttt{event} based on the specifics of the triggering event. The event is then sent to the runner to be put in the even queue.
        \item \textbf{The runner} is implemented as the class \texttt{MeowRunner}. When initialized with at least one instance of a monitor, handler, and conductor, it validates them. When started, all the monitors, handlers, and conductors it was initialized with are started. It also creates \texttt{pipes} for the communication between each element and the runner.
-        \item \textbf{Recipes} inherit from the \texttt{BaseRecipe} class. They mainly exist to contain data about a given recipe, but also contain validation checks.
-        \item \textbf{Handlers} inherit from the \texttt{BaseHandler} class. Handler classes are for a specific type of job, like the execution of bash scripts. When started, it enters an infinite loop, where it asks the runner for a valid event in the event queue, and then creates a job for the recipe, and sends it to the runner to put in the job queue.
-        \item \textbf{Conductors} inherit from the \texttt{BaseConductor} class. Conductor classes are for a specific type of job, like the execution of bash scripts. When started, it enters an infinite loop, where it asks the runner for a valid job in the job queue, and then attempts to execute it.
+        \item \textbf{Recipes} inherit from the \texttt{BaseRecipe} class. They serve primarily as a repository for the specific details of a given recipe. This typically includes identifying the particular script to be executed, but also contain validation checks of these instructions. The contained data and procedures in a recipe collectively describe the distinct actions to be taken when a corresponding job is executed.
+        \item \textbf{Handlers} inherit from the \texttt{BaseHandler} class. Handler classes are for a specific type of job, like the execution of bash scripts. When started, it enters an infinite loop, where it repeatedly asks the runner for a valid event in the event queue, and then creates a job for the recipe, and sends it to the runner to put in the job queue.
+        \item \textbf{Conductors} inherit from the \texttt{BaseConductor} class. Conductor classes are for a specific type of job, like the execution of bash scripts. When started, it enters an infinite loop, where it repeatedly asks the runner for a valid job in the job queue, and then attempts to execute it.
    \end{itemize}

    \subsubsection{The \texttt{socket} library}

    The \texttt{socket} library\autocite{SocketDoc}, included in the Python Standard Library, serves as an interface for the Berkeley sockets API. The Berkeley sockets API, originally developed for the Unix operating system, has become the standard for network communication across multiple platforms. It allows programs to create 'sockets', which are endpoints in a network communication path, for the purpose of sending and receiving data.

-    Many other libraries and modules focusing on transferring data exist for Python, some of which may be better in certain MEOW use-cases. The \texttt{ssl} library, in specific, allows for ssl-encrypted communication, which may be a requirement in workflows with sensitive data. However, implementing network triggers using the \texttt{socket} library will provide MEOW with a fundamental implementation of network events, which can later be expanded or improved with other features.
+    Many other libraries and modules focusing on transferring data exist for Python, some of which may be better in certain MEOW use-cases. The \texttt{ssl} library, for example, allows for ssl-encrypted communication, which may be a requirement in workflows with sensitive data. However, implementing network triggers using exclusively the \texttt{socket} library will provide MEOW with a fundamental implementation of network events, which can later be expanded or improved with other features (see section \textit{4.2.2}).

    In my project, all sockets use the Transmission Control Protocol (TCP), which ensures safe data transfer by enforcing a stable connection between the sender and receiver.

    I make use of the following socket methods, which have the same names and functions in the \texttt{socket} library and the Berkeley sockets API:

    \begin{itemize}
-        \setlength{\itemsep}{-5pt}
+        \setlength{\itemsep}{0pt}
        \item \texttt{bind()}: Associates the socket with a given local IP address and port. It also reserves the port locally.
        \item \texttt{listen()}: Puts the socket in a listening state, where it waits for a sender to request a TCP connection to the socket.
        \item \texttt{accept()}: Accepts the incoming TCP connection request, creating a connection.
@ -183,7 +225,7 @@
    During testing of the monitor, the following methods are used to send data to the running monitor:

    \begin{itemize}
-        \setlength{\itemsep}{-5pt}
+        \setlength{\itemsep}{0pt}
        \item \texttt{connect()}: Sends a TCP connection request to a listening socket.
        \item \texttt{sendall()}: Sends data to a socket.
    \end{itemize}
@ -197,12 +239,16 @@

    The \texttt{NetworkEventPattern} class is initialized with a triggering port, analogous to the triggering path used in file event patterns. This approach inherently limits the number of unique patterns to the number of ports that can be opened on the machine. However, given the large number of potential ports, this constraint is unlikely to present a practical issue. An alternative approach could have involved triggering patterns using a part of the sent message, essentially acting as a "header". However, this would complicate the process since the monitor is otherwise designed to receive raw data. To keep the implementation as straightforward as possible and to allow for future enhancements, I opted for simplicity and broad utility over complexity in this initial design.

-    When the \texttt{NetworkMonitor} instance is started, it starts a number of \texttt{Listener} instances, equal to the number of ports specified in its patterns. Patterns not associated with a rule are not considered, since they will not result in an event. Only one listener is started per port, so patterns with the same port use the same listener. The listeners each open a socket connected to their respective ports. This is consistent with the behavior of the file event monitor, which monitors the triggering paths of the patterns it was initialized with.
+    When the \texttt{NetworkMonitor} instance is started, it starts a number of \texttt{Listener} instances, equal to the number of ports specified in its patterns. The list of patterns is pulled when starting the monitor, so patterns added in runtime are included. Patterns not associated with a rule are not considered, since they will not result in an event. Only one listener is started per port, so patterns with the same port use the same listener. When matching an event with a rule, all rules are considered, so if multiple rules use the same triggering port, they will all be triggered.
+
+    The listeners each open a socket connected to their respective ports. This is consistent with the behavior of the file event monitor, which monitors the triggering paths of the patterns it was initialized with.

    \subsection{Integrating network events into the existing codebase}
-    The data received by the network monitor is written to a temporary file; this design choice serves three purposes:
+    The data received by the network monitor is written as a stream to a temporary file, in chunks of 2048 bytes. The temp files are created using the built-in \texttt{tempfile} library, and are placed in the os's default directory for temporary files. The library is used to accommodate different operating systems, as well as to ensure the files have unique names. When the monitor is stopped, all generated temporary files will be removed.

-    Firstly, this method is a practical solution for managing memory usage during data transfer, particularly for large data sets. By writing received data directly to a file, we bypass the need to store the entire file in memory at once, effectively addressing potential memory limitations.
+    This design choice serves three purposes:
+
+    Firstly, this method is a practical solution for managing memory usage during data transfer, particularly for large data sets. By writing received data directly to a file 2048 bytes at a time, we bypass the need to store the entire file in memory at once, effectively addressing potential memory limitations.

    Secondly, the method allows the monitor to receive multiple files simultaneously, since receiving the file will be done by separate threads. This means that a single large file will not "block up" the network port for too long.

@ -211,7 +257,18 @@
    The method will be slower, since writing to storage takes longer than keeping the data in memory, but I have decided that the positives outweigh the negatives.

    \subsection{Testing}
-    The unit tests for the network event monitor were inspired by the already existing tests for the file event monitor. Since the aim of the monitor was to emulate the behavior of the file event monitor as closely as possible, using the already existing tests with minimal changes proved an effective way of staying close to that goal.
+    The unit tests for the network event monitor were inspired by the already existing tests for the file event monitor. Since the aim of the monitor was to emulate the behavior of the file event monitor as closely as possible, using the already existing tests with minimal changes proved an effective way of staying close to that goal. The tests verify the following behavior:
+
+    \begin{itemize}
+        \setlength{\itemsep}{0pt}
+        \item Network event patterns can be initialized, and raise exceptions when given invalid parameters.
+        \item Network events can be created, and they contain the expected information.
+        \item Network monitors can be created.
+        \item A network monitor is able to receive data sent to a listener, write it to a file, and create a valid event.
+        \item You can access, add, update, and remove the patterns and recipes associated with the monitor at runtime.
+        \item When adding, updating, or removing patterns or recipes during runtime, rules associated with those patterns ore recipes are updated accordingly.
+        \item The monitor only initializes listeners for patterns with associated rules, and rules updated during runtime are applied.
+    \end{itemize}

    \section{Results}
    The testing suite designed for the monitor comprised of 26 distinct tests, all of which successfully passed. These tests were designed to assess the robustness, reliability, and functionality of the monitor. They evaluated the monitor's ability to successfully manage network event patterns, detect network events, and communicate with the runner to send events to the event queue.