March 1; 1:00-2:30 PM
Julia Stoyanovich (Drexel)
Data-driven algorithmic decision making promises to improve people's lives, accelerate scientific discovery and innovation, and bring about positive societal change. Yet, if not used responsibly, this same technology can reinforce inequity, limit accountability and infringe on the privacy of individuals: irreproducible results can influence global economic policy; algorithmic changes in search engines can sway elections and incite violence; models based on biased data can legitimize and amplify discrimination in the criminal justice system; algorithmic hiring practices can silently reinforce diversity issues and potentially violate the law; privacy and security violations can erode the trust of users and expose companies to legal and financial consequences.
In this talk I will discuss our recent work on establishing a foundational new role for database technology, in which managing data in accordance with ethical and moral norms, and legal and policy considerations becomes a core system requirement. I will define properties of responsible data management, which include fairness, transparency, and data protection. I will highlight some of our recent technical advances, and will discuss the over-all framework in which these responsibility properties are managed and enforced through all stages of the data lifecycle. The broader goal of our project is to help usher in a new phase of data science, in which the technology considers not only the accuracy of the model but also ensures that the data on which it depends respect the relevant laws, societal norms, and impacts on humans. Additional information about our project is available at DataResponsibly.com.
Julia Stoyanovich is Assistant Professor of Computer Science at Drexel University, and an affiliated faculty at the Center for Information Technology Policy at Princeton. She is a recipient of an NSF CAREER award and of an NSF/CRA Computing Innovations Fellowship. Julia's research focuses on responsible data management and analysis practices: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data acquisition and processing lifecycle. She established the Data, Responsibly consortium, serves on the ACM task force to revise the Code of Ethics and Professional Conduct, and is active in the New York City algorithmic transparency effort. In addition to data ethics, Julia works on management and analysis of preference data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst.
Canceled / To Be Rescheduled
Deep Curation: Putting Open Science Data to Work
Bill Howe (University of Washington)
Data in public repositories and in the scientific literature remains remarkably underused despite significant investments in open data and open science. Making data available online turns out to be the easy part; making the data usable for data science requires new services to support longitudinal, multi-dataset analysis rather than just settling for keyword search.
In the Deep Curation project, we use distant supervision and co-learning to automatically label datasets with zero training data. We have applied this approach to curate gene expression data and identify figures in the scientific literature, outperforming state-of-the-art supervised methods that rely on supervision. We then use claims extracted from the text of papers to guide probabilistic data integration and schema matching, affording experiments to automatically verify claims against open data, providing a repository-wide "report card" for the utility of data and the reliability of the claims against them.
Bill Howe is an Associate Professor in the Information School, Adjunct Associate Professor in Computer Science & Engineering, and Associate Director and Senior Data Science Fellow at the UW eScience Institute. He is a co-founder of Urban@UW, and with support from the MacArthur Foundation and Microsoft, leads UW's participation in the MetroLab Network. He created a first MOOC on Data Science through Coursera, and led the creation of the UW Data Science Masters Degree, where he serves as its first Program Director and Faculty Chair. He also serves on the Steering Committee of the Center for Statistics in the Social Sciences.
May 15; 3:30-5:00
Rethinking Query Execution on Big Data
Dan Suciu (University of Washington)
Database engines today use the same approach to evaluate a
query as they did forty years ago: convert the query into a query
plan, then execute each operator individually, e.g. a join, followed
by another join, followed by duplicate elimination. It turns out that
converting a query into binary joins is theoretically suboptimal, and
this can lead to poor performance over very large datasets. A new
query evaluation paradigm has emerged recently (some of it coming out
of U. of Buffalo), which, in some cases, leads to provably optimal
algorithms. In this talk I will give a brief survey and some new
results of this new paradigm: I will review the AGM bound on the query
size (Atserias, Grohe and Marx), the worst-case optimal "generic join"
algorithm for full conjunctive queries (Ngo, Re, and Rudra), and our
new algorithm for aggregate queries, called PANDA, which matches the
best known running times for certain graph problems.
(Joint work with Mahmoud Abo Khamis and Hung Ngo)
Dan Suciu is a Professor in Computer Science at the University of
Washington. He received his Ph.D. from the University of Pennsylvania
in 1995, was a principal member of the technical staff at AT&T Labs
and joined the University of Washington in 2000. Suciu is conducting
research in data management, with an emphasis on topics related to Big
Data and data sharing, such as probabilistic data, data pricing,
parallel data processing, data security. He is a co-author of two
books Data on the Web: from Relations to Semistructured Data and XML,
1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM,
holds twelve US patents, received the best paper award in SIGMOD 2000
and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in
2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE
2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient
of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu
serves on the VLDB Board of Trustees, and is an associate editor for
the Journal of the ACM, VLDB Journal, ACM TWEB, and Information
Systems and is a past associate editor for ACM TODS and ACM TOIS.
Suciu's PhD students Gerome Miklau, Christopher Re and Paris Koutris
received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and
2016 respectively, and Nilesh Dalvi was a runner up in 2008.
May 17; 3:30-5:00
Diagnoses and Explanations: Creating a Higher-Quality Data World
Alexandra Meliou (UMass Amherst)
The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.
Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also discuss our work on MIDAS, a recommendations system that improves the quality of datasets by identifying and filling information gaps. Finally, I will discuss a vision for explanation frameworks to assist the exploration of information in a varied, diverse, highly non-integrated data world.
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington, working with Dan Suciu. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.