✨

2023-06-02 18:01:51 +02:00
parent 32dc47742e
commit 9a0476c005
6 changed files with 141 additions and 8 deletions
--- a/bachelor-project-nikolaj.pdf
+++ b/bachelor-project-nikolaj.pdf
--- a/bachelor-project-nikolaj.tex
+++ b/bachelor-project-nikolaj.tex
@ -9,6 +9,7 @@
 \usepackage{float}
 \usepackage{fontspec}
 \usepackage{enumitem}
+\usepackage{array}

 \usetikzlibrary{arrows.meta, positioning, calc, quotes}

@ -26,6 +27,8 @@

 \setlength{\parskip}{5pt}

+\newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}
+
 \begin{document}
    \section{Abstract}
    \begin{tcolorbox}[colback=lightgray!30!white]
@ -34,7 +37,7 @@

    \section{Introduction}

-    \textit{Scientific Workflow Management Systems} (SWMSs) are an essential tool for automating, managing, and executing complex scientific processes involving large volumes of data and computational tasks\footnote{citation?}. Jobs in a SWMS workflows are typically defined as the nodes in a Directed Acyclic Graph (DAG), where the edges define the dependencies of each job.
+    \textit{Scientific Workflow Management Systems} (SWMSs) are an essential tool for automating, managing, and executing complex scientific processes involving large volumes of data and computational tasks. Jobs in a SWMS workflows are typically defined as the nodes in a Directed Acyclic Graph (DAG), where the edges define the dependencies of each job.

    \begin{figure}[H]
        \begin{center}
@ -64,7 +67,7 @@

    In such scenarios, using a \textit{dynamic scheduler} can offer a more effective approach. Unlike traditional DAG-based systems, dynamic schedulers are designed to adapt dynamically to changing conditions, providing a more adaptive method for managing complex workflows. One such dynamic scheduler is the \textit{Managing Event Oriented Workflows}\autocite{DavidMEOW} (MEOW).

-    MEOW employs an event-based scheduler, in which jobs are executed independently, based on certain \textit{triggers}. Triggers can in theory be anything, but are currently limited to file events on local storage. By dynamically adapting the execution order based on the outcomes of previous tasks or external factors, MEOW provides a more flexible solution for processing large volumes of experimental data, with minimal human validation and interaction.\footnote{citation?}.
+    MEOW employs an event-based scheduler, in which jobs are executed independently, based on certain \textit{triggers}. Triggers can in theory be anything, but are currently limited to file events on local storage. By dynamically adapting the execution order based on the outcomes of previous tasks or external factors, MEOW provides a more flexible solution for processing large volumes of experimental data, with minimal human validation and interaction\autocite{DavidMEOWpaper}.

    \begin{figure}[H]
        \begin{center}
@ -207,7 +210,7 @@

    The \texttt{socket} library\autocite{SocketDoc}, included in the Python Standard Library, serves as an interface for the Berkeley sockets API. The Berkeley sockets API, originally developed for the Unix operating system, has become the standard for network communication across multiple platforms. It allows programs to create 'sockets', which are endpoints in a network communication path, for the purpose of sending and receiving data.

-    Many other libraries and modules focusing on transferring data exist for Python, some of which may be better in certain MEOW use-cases. The \texttt{ssl} library, for example, allows for ssl-encrypted communication, which may be a requirement in workflows with sensitive data. However, implementing network triggers using exclusively the \texttt{socket} library will provide MEOW with a fundamental implementation of network events, which can later be expanded or improved with other features (see section \textit{4.2.2}).
+    Many other libraries and modules focusing on transferring data exist for Python, some of which may be better in certain MEOW use-cases. The \texttt{ssl} library, for example, allows for ssl-encrypted communication, which may be a requirement in workflows with sensitive data. However, implementing network triggers using exclusively the \texttt{socket} library will provide MEOW with a fundamental implementation of network events, which can later be expanded or improved with other features (see section \textit{\ref{Additional Monitors}}).

    In my project, all sockets use the Transmission Control Protocol (TCP), which ensures safe data transfer by enforcing a stable connection between the sender and receiver.

@ -256,6 +259,9 @@

    The method will be slower, since writing to storage takes longer than keeping the data in memory, but I have decided that the positives outweigh the negatives.

+    \subsubsection{Data Type Agnosticism}
+    An important aspect to consider in the functioning of the network monitor is its data type agnosticism: the network monitor does not impose restrictions or perform checks on the type of incoming data. While this approach enhances the speed and simplicity of the implementation, it also places a certain level of responsibility on the recipes that work with the incoming data. The recipes, being responsible for defining the actions taken upon execution of a job, must be designed with a full understanding of this versatility. They should incorporate necessary checks and handle potential inconsistencies or anomalies that might arise from diverse types of incoming data.
+
    \subsection{Testing}
    The unit tests for the network event monitor were inspired by the already existing tests for the file event monitor. Since the aim of the monitor was to emulate the behavior of the file event monitor as closely as possible, using the already existing tests with minimal changes proved an effective way of staying close to that goal. The tests verify the following behavior:

@ -273,10 +279,81 @@
    \section{Results}
    The testing suite designed for the monitor comprised of 26 distinct tests, all of which successfully passed. These tests were designed to assess the robustness, reliability, and functionality of the monitor. They evaluated the monitor's ability to successfully manage network event patterns, detect network events, and communicate with the runner to send events to the event queue.

-    \subsection{Discussion}
-    \begin{tcolorbox}[colback=lightgray!30!white]
-        With the hindsight of the results, what could I have done better?
-    \end{tcolorbox}
+
+    \subsection{Performance Tests}
+    To assess the performance of the Network Monitor, I have implemented a number of performance tests. The tests were run on these machines:
+
+    \begin{table}[H]
+        \centering
+        \begin{tabular}{|c||c|c|c|c|}\hline
+            \textbf{Identifier} & \textbf{CPU} & \textbf{Cores} & \textbf{Clock speed} & \textbf{Memory} \\ \hline
+            Laptop & Intel i5-8250U & 4 & 1.6GHz & 8GB \\ \hline
+            Desktop & & & & \\ \hline
+        \end{tabular}
+    \end{table}
+
+    \subsubsection{Single Listener}
+    To assess how a single listener handles many events at once, I implemented a procedure where a single listener in the monitor was subjected to a varying number of events, ranging from 1 to 1,000. For each quantity of events, I sent n network events to the monitor and recorded the response time. To ensure reliability of the results and mitigate the effect of any outliers, each test was repeated 50 times.
+
+    Given the inherent variability in network communication and event handling, I noted considerable differences between the highest and lowest recorded times for each test. To provide a comprehensive view of the monitor's performance, I have included not only the average response times, but also the minimum and maximum times observed for each set of 50 tests.
+
+    \begin{table}[H]
+        \centering
+        \begin{tabular}{|p{1.1cm}||P{1.5cm}|P{1.8cm}||P{1.5cm}|P{1.8cm}||P{1.5cm}|P{1.8cm}|}
+            \hline
+            \textbf{Event} & \multicolumn{2}{c||}{\textbf{Minimum time}} & \multicolumn{2}{c||}{\textbf{Maximum time}} & \multicolumn{2}{c|}{\textbf{Average time}} \\
+            \textbf{count} & Total & Per event & Total & Per event & Total & Per event \\ \hline\hline
+            \multicolumn{7}{|c|}{\textbf{Laptop}} \\ \hline
+            1 & 0.68ms & 0.68ms & 5.3ms & 5.3ms & 2.1ms & 2.1ms \\\hline
+            10 & 4.7ms & 0.47ms & 2.1s & 0.21s & 0.18s & 18ms \\\hline
+            100 & 45ms & 0.45ms & 7.2s & 72ms & 0.86s & 8.6ms \\\hline
+            1,000 & 0.63s & 0.63ms & 17s & 17ms & 5.6s & 5.6ms \\\hline\hline
+            \multicolumn{7}{|c|}{\textbf{Desktop}} \\ \hline
+            1 & & & & & &  \\\hline
+            10 & & & & & &  \\\hline
+            100 & & & & & &  \\\hline
+            1000 & & & & & &  \\\hline
+        \end{tabular}
+        \caption{The results of the Single Listener performance tests with 2 significant digits.}
+    \end{table}
+
+    \begin{figure}[H]
+        \centering
+        \includegraphics[width=0.8\textwidth]{src/performance_results/laptop_single_listener.png}
+        \caption{The results plotted logarithmically.}
+    \end{figure}
+
+    Upon examination of the results, an pattern emerges. The minimum recorded response times consistently averaged around 0.5ms per event, regardless of the number of events sent. This time likely reflects an ideal scenario where events are registered seamlessly without any delays or issues within the pipeline, thereby showcasing the efficiency potential of the network event triggers in the MEOW system.
+
+    In contrast, the maximum and average response times exhibited more variability. This fluctuation in response times may be attributed to various factors such as network latency, the internal processing load of the system, and the inherent unpredictability of concurrent event handling.
+
+    \subsubsection{Multiple Listeners}
+    The next performance test investigates how the introduction of multiple listeners affects the overall processing time. This test aims to understand the implications of distributing events across different listeners on system performance. Specifically, we're looking at how having multiple listeners in operation might impact the speed at which events are processed.
+
+    In this test, I will maintain a constant total of 1000 events, but they will distributed evenly across varying numbers of listeners: 1, 10, 100, and 1000. By keeping the total number of events constant while altering the number of listeners, I aim to isolate the effect of multiple listeners on system performance.
+
+    A key expectation for this test is to observe if and how much the overall processing time increases as the number of listeners goes up. This would give insight into whether operating more listeners concurrently introduces additional overhead, thereby slowing down the process. The results of this test would then inform decisions about optimal listener numbers in different usage scenarios, potentially leading to performance improvements in MEOW's handling of network events.
+
+    \begin{table}[H]
+        \centering
+        \begin{tabular}{|p{1.5cm}||P{2.5cm}|P{2.5cm}|P{2.5cm}|}
+            \hline
+            \textbf{Listener} & \textbf{Minimum time} & \textbf{Maximum time} & \textbf{Average time} \\ \hline
+            \multicolumn{4}{|c|}{\textbf{Laptop}} \\ \hline
+            1 & 0.63s & 17s & 5.6s  \\\hline
+            10 & 0.46s & 25s & 7.6s \\\hline
+            100 & 0.42s & 20s & 7.1s  \\\hline
+            1000 & & &   \\\hline
+            \multicolumn{4}{|c|}{\textbf{Desktop}} \\ \hline
+            1 & & &  \\\hline
+            10 & & & \\\hline
+            100 & & & \\\hline
+            1000 & & & \\\hline
+        \end{tabular}
+        \caption{The results of the Multiple Listeners performance tests with 2 significant digits.}
+    \end{table}
+
+    % \subsection{Discussion}

    \subsection{Future Work}
    \subsubsection{Use-cases for Network Events}
@ -291,7 +368,7 @@
        \caption{The structure of the BIDS workflow. Data is transferred to user, and to the cloud.}
    \end{figure}

-    \subsubsection{Additional Monitors}
+    \subsubsection{Additional Monitors}\label{Additional Monitors}
    The successful development and implementation of the network event monitor for MEOW serves as a precedent for the creation of additional monitors in the future. This framework could be utilized as a blueprint for developing new monitors tailored to meet specific demands, protocols, or security requirements.

    For instance, security might play a crucial role in the processing and transfer of sensitive data across various workflows. The network event monitor developed in this project, which uses the Python socket library, might not satisfy the security requirements of all workflows, especially those handling sensitive data. In such cases, developing a monitor that leverages the \texttt{ssl} library could provide a solution, enabling encrypted communication and thus improving the security of data transfer. The architecture of the network event monitor can guide the development of an \texttt{ssl} monitor, taking advantage of the similarities between the \texttt{socket} and \texttt{ssl} libraries.
--- a/src/make_graphs.py
+++ b/src/make_graphs.py
@ -0,0 +1,49 @@
+import matplotlib.pyplot as plt
+
+def laptop_single_listener():
+    plt.figure()
+    x = [1,10,100,1000]
+    y1 = [0.00068,0.0047,0.045,00.63]
+    y2 = [0.00530,2.1000,7.200,17.00]
+    y3 = [0.00210,0.1800,0.860,05.60]
+
+    plt.plot(x, y2, label="Maximum")
+    plt.plot(x, y3, label="Average")
+    plt.plot(x, y1, label="Minimum")
+
+    plt.legend()
+    plt.grid()
+
+    plt.xlabel("Event count")
+    plt.ylabel("Time (Laptop)")
+
+    plt.xscale("log")
+    plt.yscale("log")
+
+    plt.savefig("performance_results/laptop_single_listener.png")
+
+def laptop_multiple_listeners():
+    plt.figure()
+    x = [1,10,100,1000]
+    y1 = [00.63,00.46,00.42,00.92]
+    y2 = [17.00,25.00,20.00,03.24]
+    y3 = [05.60,07.60,07.10,01.49]
+
+    plt.plot(x, y2, label="Maximum")
+    plt.plot(x, y3, label="Average")
+    plt.plot(x, y1, label="Minimum")
+
+    plt.legend()
+    plt.grid()
+
+    plt.xlabel("Listener count")
+    plt.ylabel("Time (Laptop)")
+
+    plt.xscale("log")
+    plt.yscale("log")
+
+    plt.savefig("performance_results/laptop_multiple_listeners.png")
+
+if __name__ == "__main__":
+    laptop_single_listener()
+    laptop_multiple_listeners()
--- a/src/performance_results/laptop_multiple_listeners.png
+++ b/src/performance_results/laptop_multiple_listeners.png
--- a/src/performance_results/laptop_single_listener.png
+++ b/src/performance_results/laptop_single_listener.png
--- a/src/references.bib
+++ b/src/references.bib
@ -6,6 +6,13 @@
  month    = may
 }

+@misc{DavidMEOWpaper,
+  title    = "Events as a Basis for Workflow Scheduling",
+  author   = "David Marchant",
+  howpublished = "\url{https://sid.erda.dk/share_redirect/CA1fbrNHoD}",
+  school   = "University of Copenhagen"
+}
+
@misc{SocketDoc,
  title        = "socket - Low-level networking interface",
  author       = "Python documentation",