The SDR Database v.2.0 (SDR2) is a multi-country, multi-year database for research on political participation, social capital, and well-being. It comprises harmonized information from 23 international survey projects, covering over 4.4 million respondents from 156 countries in the period 1966 – 2017. SDR2 provides both target variables and methodological indicators that store source survey and ex-post harmonization metadata. SDR2 consists of three datasets. The MASTER file, which stores harmonized information for a total of 4,402,489 respondents. The auxiliary PLUG-SURVEY file containing controls for source data quality and a set of technical variables needed for merging this file with the MASTER file. And the PLUG-COUNTRY file, which is a dictionary of countries and territories used in the MASTER file. An overall description of the SDR2 Database, and detailed information about its datasets are available in the SDR2 documentation. SDR2 is a product of the project Survey Data Recycling: New Analytic Framework, Integrated Database, and Tools for Cross-national Social, Behavioral and Economic Research, financed by the US National Science Foundation (PTE Federal award 1738502). We thank the Ohio State University and the Institute of Philosophy and Sociology, Polish Academy of Sciences, for organizational support.
more »
« less
SDR Querier: A Visual Querying Framework for Cross-National Survey Data Recycling
Public opinion surveys constitute a widespread, powerful tool to study peoples’ attitudes and behaviors from comparative perspectives. However, even global surveys can have limited geographic and temporal coverage, which can hinder the production of comprehensive knowledge. To expand the scope of comparison, social scientists turn to ex-post harmonization of variables from datasets that cover similar topics but in different populations and/or at different times. These harmonized datasets can be analyzed as a single source and accessed through various data portals. However, the Survey Data Recycling (SDR) research project has identified three challenges faced by social scientists when using data portals: the lack of capability to explore data in-depth or query data based on customized needs, the difficulty in efficiently identifying related data for studies, and the incapability to evaluate theoretical models using sliced data. To address these issues, the SDR research project has developed the SDR Querier, which is applied to the harmonized SDR database. The SDR Querier includes a BERT-based model that allows for customized data queries through research questions or keywords (Query-by-Question), a visual design that helps users determine the availability of harmonized data for a given research question (Query-by-Condition), and the ability to reveal the underlying relational patterns among substantive and methodological variables in the database (Query-by-Relation), aiding in the rigorous evaluation or improvement of regression models. Case studies with multiple social scientists have demonstrated the usefulness and effectiveness of the SDR Querier in addressing daily challenges.
more »
« less
- Award ID(s):
- 1738502
- PAR ID:
- 10478644
- Publisher / Repository:
- IEEE Computer Society
- Date Published:
- Journal Name:
- IEEE Transactions on Visualization and Computer Graphics
- Volume:
- 29
- Issue:
- 6
- ISSN:
- 1077-2626
- Page Range / eLocation ID:
- 2862 to 2874
- Subject(s) / Keyword(s):
- Data visualization, Data models, Biological system modeling, Rivers, Portals, Bit error rate, Sociology
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
SDR 2.0 Cotton File: Cumulative List of Variables in the Surveys of the SDR Database is a comprehensive data dictionary, in Microsoft Excel format. Its main purpose is to facilitate the overview of 88118 variables (i.e. variable names, values, and labels) available in the original (source) data files that we retrieved automatically for harmonization purposes in the SDR Project. Information in the Cotton File comes from 215 source data files that comprise ca. 3500 national surveys administered between 1966 and 2017 in 169 countries or territories, as part of 23 international survey projects.more » « less
-
Large-scale organic data generated from newspapers, social media, television, and radio require an expertise in infrastructure management, data collection, and data processing in order to gain research value from them. We have developed text analytic research portals to help social science researchers who do not have the resources necessary to collect, store, and process these large-scale data sets. Our portals allow researchers to use an intuitive point and click interface to generate variables from large, dynamic data sets using state of the art text mining and learning methods. These timely variables constructed from noisy text can then be used to advance social science research in areas such as political science, economics, public health, and psychology research.more » « less
-
null (Ed.)This demonstration showcases Chestnut, a data layout generator for in-memory object-oriented database applications. Given an application and a memory budget, Chestnut generates a customized in-memory data layout and the corresponding query plans that are specialized for the application queries. Our demo will let users design and improve simple web applications using Chestnut. Users can view the Chestnut-generated data layouts using a custom visualization system, which will allow users to see how the application parameters affect Chestnut's design. Finally, users will be able to run queries generated by the application via the customized query plans generated by Chestnut or traditional relational query engines, and can compare the results and observe the speedup achieved by the Chestnut-generated query plans.more » « less
-
Electronic medical records (EMR) contain comprehensive patient information and are typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EMR data is a challenging task for medical experts. Question-to-SQL generation methods tackle this problem by first predicting the SQL query for a given question about a database, and then, executing the query on the database. However, most of the existing approaches have not been adapted to the healthcare domain due to a lack of healthcare Question-to-SQL dataset for learning models specific to this domain. In addition, wide use of the abbreviation of terminologies and possible typos in questions introduce additional challenges for accurately generating the corresponding SQL queries. In this paper, we tackle these challenges by developing a deep learning based TRanslate-Edit Model for Question-to-SQL (TREQS) generation, which adapts the widely used sequence-to-sequence model to directly generate the SQL query for a given question, and further performs the required edits using an attentive-copying mechanism and task-specific look-up tables. Based on the widely used publicly available electronic medical database, we create a new large-scale Question-SQL pair dataset, named MIMICSQL, in order to perform the Question-to-SQL generation task in healthcare domain. An extensive set of experiments are conducted to evaluate the performance of our proposed model on MIMICSQL. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in predicting condition values and its robustness to random questions with abbreviations and typos.more » « less
An official website of the United States government

