skip to main content

Title: WLCG Networks: Update on Monitoring and Analytics
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues including connection failures, congestion and traffic routing. The OSG Networking Area, in partnership with WLCG, is focused on being the primary source of networking information for its partners and constituents. It was established to ensure sites and experiments can better understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with higher level workload and data transfer services. This has been facilitated by the global network of the perfSONAR instances that have been commissioned and are operated in collaboration with WLCG Network Throughput Working Group. An additional important update is the inclusion of the newly funded NSF project SAND (Service Analytics and Network Diagnosis) which is focusing on network analytics. This paper describes the current state of the network measurement and analytics platform and summarises the activities taken by the working group and our collaborators. This includes the progress being made in providing higher level analytics, alerting and alarming from the rich set of network metrics we are gathering.
Authors:
; ; ; ; ; ; ; ; ;
Editors:
Doglioni, C.; Kim, D.; Stewart, G.A.; Silvestris, L.; Jackson, P.; Kamleh, W.
Award ID(s):
1836650 1827116
Publication Date:
NSF-PAR ID:
10257016
Journal Name:
EPJ Web of Conferences
Volume:
245
Page Range or eLocation-ID:
07053
ISSN:
2100-014X
Sponsoring Org:
National Science Foundation
More Like this
  1. Doglioni, C. ; Kim, D. ; Stewart, G.A. ; Silvestris, L. ; Jackson, P. ; Kamleh, W. (Ed.)
    High Energy Physics (HEP) experiments rely on the networks as one of the critical parts of their infrastructure both within the participating laboratories and sites as well as globally to interconnect the sites, data centres and experiments instrumentation. Network virtualisation and programmable networks are two key enablers that facilitate agile, fast and more economical network infrastructures as well as service development, deployment and provisioning. Adoption of these technologies by HEP sites and experiments will allow them to design more scalable and robust networks while decreasing the overall cost and improving the effectiveness of the resource utilization. The primary challenge wemore »currently face is ensuring that WLCG and its constituent collaborations will have the networking capabilities required to most effectively exploit LHC data for the lifetime of the LHC. In this paper we provide a high level summary of the HEPiX NFV Working Group report that explored some of the novel network capabilities that could potentially be deployment in time for HL-LHC.« less
  2. The experiments at the Large Hadron Collider (LHC) rely upon a complex distributed computing infrastructure (WLCG) consisting of hundreds of individual sites worldwide at universities and national laboratories, providing about half a billion computing job slots and an exabyte of storage interconnected through high speed networks. Wide Area Networking (WAN) is one of the three pillars (together with computational resources and storage) of LHC computing. More than 5 PB/day are transferred between WLCG sites. Monitoring is one of the crucial components of WAN and experiments operations. In the past years all experiments have invested significant effort to improve monitoring and integratemore »networking information with data management and workload management systems. All WLCG sites are equipped with perfSONAR servers to collect a wide range of network metrics. We will present the latest development to provide the 3D force directed graph visualization for data collected by perfSONAR. The visualization package allows site admins, network engineers, scientists and network researchers to better understand the topology of our Research and Education networks and it provides the ability to identify nonreliable or/and nonoptimal network paths, such as those with routing loops or rapidly changing routes.« less
  3. Kubernetes, an open-source container orchestration platform, has been widely adopted by cloud service providers (CSPs) for its advantages in simplifying container deployment, scalability and scheduling. Networking is one of the central components of Kubernetes, providing connectivity between different pods (group of containers) both within the same host and across hosts. To bootstrap Kubernetes networking, the Container Network Interface (CNI) provides a unified interface for the interaction between container runtimes. There are several CNI implementations, available as open-source ‘CNI plugins’. While they differ in functionality and performance, it is a challenge for a cloud provider to differentiate and choose the appropriatemore »plugin for their environment. In this paper, we compare the various open source CNI plugins available from the community, qualitatively and through detailed quantitative measurements. With our experimental evaluation, we analyze the overheads and bottlenecks for each CNI plugin, as a result of the network model it implements, interaction with the host network protocol stack and the network policies implemented in iptables rules. The choice of the CNI plugin may also be based on whether intra-host or inter-host communication dominates.« less
  4. In this paper, we propose a responsive autonomic and data-driven adaptive virtual networking framework (RAvN) to detect and mitigate anomalous network behavior. The proposed detection scheme detects both low rate and high rate denial of service (DoS) attacks using (1) a new Centroid-based clustering technique, (2) a proposed Intragroup variance technique for data features within network traffic (C.Intra) and (3) a multivariate Gaussian distribution model fitted to the constant changes in the IP addresses of the network. RAvN integrates the adaptive reconfigurable features of a popular SDN platform (open networking operating system (ONOS)); the network performance statistics provided by trafficmore »monitoring tools (such as T-shark or sflow-RT); and the analytics and decision-making tools provided by new and current machine learning techniques. The decision making and execution components generate adaptive policy updates (i.e. anomalous mitigation solutions) on-the-fly to the ONOS SDN controller for updating network configurations and flows. In addition, we compare our anomaly detection schemes for detecting low rate and high rate DoS attacks versus a commonly used unsupervised machine learning technique, Kmeans. Kmeans recorded 72.38% accuracy, while the multivariate clustering and the Intra-group variance methods recorded 80.54% and 96.13% accuracy respectively, a significant performance improvement.« less
  5. SLATE (Services Layer at the Edge) is a new project that, when complete, will implement “cyberinfrastructure as code” by augmenting the canonical Science DMZ pattern with a generic, programmable, secure and trusted underlayment platform. This platform will host advanced container-centric services needed for higher-level capabilities such as data transfer nodes, software and data caches, workflow services and science gateway components. SLATE will use best-of-breed data center virtualization components, and where available, software defined networking, to enable distributed automation of deployment and service lifecycle management tasks by domain experts. As such it will simplify creation of scalable platforms that connect researchmore »teams, institutions and resources to accelerate science while reducing operational costs and development cycle times. Since SLATE will be designed to require only commodity components for its functional layers, its potential for building distributed systems should extend across all data center types and scales, thus enabling creation of ubiquitous, science-driven cyberinfrastructure. By providing automation and programmatic interfaces to distributed HPC backends and other cyberinfrastructure resources, SLATE will amplify the reach of science gateways and therefore the domain communities they support.« less