skip to main content


Title: Compute Choice: Learning Distributions over Permutations
We discuss the question of learning distribution over permutations of a given set of choices, options or items based on partial observations. This is central to capturing the so called ``choice'' in a variety of contexts: understanding preferences of consumers over a collection of products based on purchasing and browsing data in the setting of retail and e-commerce, learning public opinion amongst a collection of socio-economic issues based on sparse polling data, deciding a ranking of teams or players based on outcomes of games, electing leaders based on votes, and more generally collaborative decision making based on collective judgement such as accepting paper(s) in a competitive academic conference. The question of learning distribution over permutations arises beyond capturing ``choice'' as well. For example, tracking a collection of objects using noisy cameras, or aggregating ranking of web-pages using outcomes of multiple search engines. It is only natural that such a topic has been extensively studied in Economics, Political Science and Psychology for more than a century, and more so recently in Computer Science, Electrical Engineering, Statistics and Operations Research. Here we shall focus on the task of learning distribution over permutations from its marginal distributions of two types: first-order marginals and pair-wise comparisons. There has been a lot of progress made on this topic in the last decade. The ideal goal is to provide a comprehensive overview of the state-of-art on this topic. We shall provide detailed overview of selective aspects, biased by author's perspective of the topic. And provide sufficient pointers to aspects not covered here. We shall emphasize on ability to identify the entire distribution over permutation as well as ``best ranking''.  more » « less
Award ID(s):
1740751 1462158 1634259
NSF-PAR ID:
10112491
Author(s) / Creator(s):
Date Published:
Journal Name:
Cambridge University Press bulletin
ISSN:
0951-2454
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract  
    more » « less
  2. Parallel and distributed computing (PDC) has become pervasive in all aspects of computing, and thus it is essential that students include parallelism and distribution in the computational thinking that they apply to problem solving, from the very beginning. Computer science education is still teaching to a 20th century model of algorithmic problem solving. Sequence, branch, and loop are taught in our early courses as the only organizing principles needed for algorithms, and we invest considerable time in showing how best to sequentially process large volumes of data. All computing devices that students use currently have multiple cores as well as a GPU in many cases. Most of their favorite applications use multiple cores and numbers of distributed processors. Often concurrency offers simpler solutions than sequential approaches. Industry is desperate for software engineers who think naturally in terms of exploiting these capabilities, rather than seeing them as an exotic upper-level topic that gets layered over a sequential solution. However, we are still teaching students to solve problems using sequential thinking. In this workshop we overview key PDC concepts and provide examples of how they may naturally be incorporated in early computing classes. We will introduce plugged and unplugged curriculum modules that have been successfully integrated in existing computing classes at multiple institutions. We will highlight the upcoming summer training workshop, for which we have funding to support attendance, as well as other CDER (Center for Parallel and Distributed Computing Curriculum Development and Educational Resources) activities. 
    more » « less
  3. Despite increased calls for the need for more diverse engineers and significant efforts to “move the needle,” the composition of students, especially women, earning bachelor’s degrees in engineering has not significantly changed over the past three decades. Prior research by Klotz and colleagues (2014) showed that sustainability as a topic in engineering education is a potentially positive way to increase women’s interest in STEM at the transition from high school to college. Additionally, sustainability has increasingly become a more prevalent topic in engineering as the need for global solutions that address the environmental, social, and economic aspects of sustainability have become more pressing. However, few studies have examined students’ sustainability related career for upper-level engineering students. This time point is a critical one as students are transitioning from college to industry or other careers where they may be positioned to solve some of these pressing problems. In this work, we answer the question, “What differences exist between men and women’s attitudes about sustainability in upper-level engineering courses?” in order to better understand how sustainability topics may promote women’s interest in and desire to address these needs in their future careers. We used pilot data from the CLIMATE survey given to 228 junior and senior civil, environmental, and mechanical engineering students at a large East Coast research institution. This survey included questions about students’ career goals, college experiences, beliefs about engineering, and demographic information. The students surveyed included 62 third-year students, 96 fourth-year students, 29 fifth-year students, and one sixth-year student. In order to compare our results of upper-level students’ attitudes about sustainability, we asked the same questions as the previous study focused on first-year engineering students, “Which of these topics, if any, do you hope to directly address in your career?” The list of topics included energy (supply or demand), climate change, environmental degradation, water supply, terrorism and war, opportunities for future generations, food availability, disease, poverty and distribution of resources, and opportunities for women and/or minorities. As the answer to this question was binary, either “Yes,” or “No,” Pearson’s Chi-squared test with Yates’ continuity correction was performed on each topic for this question, comparing men and women’s answers. We found that women are significantly more likely to want to address water supply, food availability, and opportunities for woman and/or minorities in their careers than their male peers. Conversely, men were significantly more likely to want to address energy and terrorism and war in their careers than their female peers. Our results begin to help us understand the particular differences that men and women, even far along in their undergraduate engineering careers, may have in their desire to address certain sustainability outcomes in their careers. This work begins to let us understand certain topics and pathways that may support women in engineering as well as provides comparisons to prior work on early career undergraduate students. Our future work will include looking at particular student experiences in and out of the classroom to understand how these sustainability outcome expectations develop. 
    more » « less
  4. Abstract

    The widespread digitization of natural history collections, combined with novel tools and approaches is revolutionizing biodiversity science. The ‘extended specimen’ concept advocates a more holistic approach in which a specimen is framed as a diverse stream of interconnected data. Herbarium specimens that by their very nature capture multispecies relationships, such as certain parasites, fungi and lichens, hold great potential to provide a broader and more integrative view of the ecology and evolution of symbiotic interactions. This particularly applies to parasite–host associations, which owing to their interconnectedness are especially vulnerable to global environmental change.

    Here, we present an overview of how parasitic flowering plants is represented in herbarium collections. We then discuss the variety of data that can be gathered from parasitic plant specimens, and how they can be used to understand global change impacts at multiple scales. Finally, we review best practices for sampling parasitic plants in the field, and subsequently preparing and digitizing these specimens.

    Plant parasitism has evolved 12 times within angiosperms, and similar to other plant taxa, herbarium collections represent the foundation for analysing key aspects of their ecology and evolution. Yet these collections hold far greater potential. Data and metadata obtained from parasitic plant specimens can inform analyses of co‐distribution patterns, changes in eco‐physiology and species plasticity spanning temporal and spatial scales, chemical ecology of tripartite interactions (e.g. host–parasite–herbivore), and molecular data critical for species conservation. Moreover, owing to the historic nature and sheer size of global herbarium collections, these data provide the spatiotemporal breadth essential for investigating organismal response to global change.

    Parasitic plant specimens are primed to serve as ideal examples of extended specimen concept and help motivate the next generation of creative and impactful collection‐based science. Continued digitization efforts and improved curatorial practices will contribute to opening these specimens to a broader audience, allowing integrative research spanning multiple domains and offering novel opportunities for education.

     
    more » « less
  5. Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a National Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per case with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath. 
    more » « less