skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: HaaS in Environmental Computing: Hadoop-as-a-Service for Big Data Mining in Environmental Computing Applications
This article addresses the importance of HaaS (Hadoop-as-a-Service) in cloud technologies, with specific reference to its usefulness in big data mining for environmental computing applications. The term environmental computing refers to computational analysis within environmental science and management, encompassing a myriad of techniques, especially in data mining and machine learning. As is well-known, the classical MapReduce has been adapted within many applications for big data storage and information retrieval. Hadoop based tools such as Hive and Mahout are broadly accessible over the cloud and can be helpful in data warehousing and data mining over big data in various domains. In this article, we explore HaaS technologies, mainly based on Apache's Hive and Mahout for applications in environmental computing, considering publicly available data on the Web. We dwell upon interesting applications such as automated text classification for energy management, recommender systems for ecofriendly products, and decision support in urban planning. We briefly explain the classical paradigms of MapReduce, Hadoop and Hive, further delve into data mining and machine learning over the MapReduce framework, and explore techniques such as Naïve Bayes and Random Forests using Apache Mahout with respect to the targeted applications. Hence, the paradigm of Hadoop-as-a-Service, popularly referred to as HaaS, is emphasized here as per its benefits in a domain-specific context. The studies in environmental computing, as presented in this article, can be useful in other domains as well, considering similar applications. This article can thus be interesting to professionals in web technologies, cloud computing, environmental management, as well as AI and data science in general.  more » « less
Award ID(s):
2018575
PAR ID:
10570591
Author(s) / Creator(s):
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM SIGWEB Newsletter
Volume:
2024
Issue:
Autumn
ISSN:
1931-1745
Page Range / eLocation ID:
1 to 18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. There is an increasing demand for processing large volumes of unstructured data for a wide variety of applications. However, protection measures for these big data sets are still in their infancy, which could lead to significant security and privacy issues. Attribute-based access control (ABAC) provides a dynamic and flexible solution that is effective for mediating access. We analyzed and implemented a prototype application of ABAC to large dataset processing in Amazon Web Services, using open-source versions of Apache Hadoop, Ranger, and Atlas. The Hadoop ecosystem is one of the most popular frameworks for large dataset processing and storage and is adopted by major cloud service providers. We conducted a rigorous analysis of cybersecurity in implementing ABAC policies in Hadoop, including developing a synthetic dataset of information at multiple sensitivity levels that realistically represents healthcare and connected social media data. We then developed Apache Spark programs that extract, connect, and transform data in a manner representative of a realistic use case. Our result is a framework for securing big data. Applying this framework ensures that serious cybersecurity concerns are addressed. We provide details of our analysis and experimentation code in a GitHub repository for further research by the community. 
    more » « less
  2. Popular big data frameworks, ranging from Hadoop MapReduce to Spark, rely on garbage-collected languages, such as Java and Scala. Big data applications are especially sensitive to the effectiveness of garbage collection (i.e., GC), because they usually process a large volume of data objects that lead to heavy GC overhead. Lacking in-depth understanding of GC performance has impeded performance improvement in big data applications. In this paper, we conduct the first comprehensive evaluation on three popular garbage collectors, i.e., Parallel, CMS, and G1, using four representative Spark applications. By thoroughly investigating the correlation between these big data applications’ memory usage patterns and the collectors’ GC patterns, we obtain many findings about GC inefficiencies. We further propose empirical guidelines for application developers, and insightful optimization strategies for designing big-data-friendly garbage collectors. 
    more » « less
  3. Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator. Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data. 
    more » « less
  4. The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework. 
    more » « less
  5. The overconsumption of energy in recent times has motivated many studies. Some of these explore the application of web technologies and machine learning models, aiming to increase energy efficiency and reduce the carbon footprint. This paper aims to review three areas that overlap between the web and energy usage in the commercial sector: IoT (Internet of Things), cloud computing and opinion mining. The paper elaborates on problems in terms of their causes, influences, and potential solutions, as found in multiple studies across these areas; and intends to identify potential gaps with the scope for further research. In the rapidly digitizing and automated world, these three areas can offer much contribution towards reducing energy consumption and making the commercial sector more energy efficient. IoT and smart manufacturing can assist much in effective production, and more efficient technologies as per energy usage. Cloud computing, with reference to its impact on green IT (information technology), is a major area that contributes towards the mitigation of carbon footprint and the reduction of costs on energy consumption. Opinion mining is significant as per the part it plays in understanding the feelings, requirements and demands of the consumers of energy as well as the related stakeholders, so as to help create more suitable policies and hence navigate towards more energy efficient strategies. This paper offers comprehensive analyses on the literature in the concerned areas to fathom the current status and explore future possibilities of research across these areas and the related multidisciplinary avenues. 
    more » « less