Scientific Data Engineer II

Scientific Data Engineer II – Data Integration Team

The mission of the Allen Institute is to unlock the complexities of bioscience and advance our knowledge to improve human health. Using an open science, multi-scale, team-oriented approach, the Allen Institute focuses on accelerating foundational research, developing standards and models, and cultivating new ideas to make a broad, transformational impact on science.

The mission of the Allen Institute for Brain Science is to accelerate the understanding of how the human brain works in health and disease. Using a big science approach, we generate useful public resources, drive technological and analytical advances, and discover fundamental brain properties through integration of experiments, modeling and theory.

We are looking for a Scientific Data Engineer with a strong track record of supporting data management or analytics. This position will work closely with research and data production teams to support data modeling, ingest, curation and integration of large scale and diverse data on an ongoing basis.

The ideal candidate will have a strong background in data modeling and data integration and will bring a passion for linking data and knowledge to help create a useful compendium of information for neuroscientists. You will interact regularly with neuroscientists and a wide variety of engineers both internally and as part of a large consortium, collaborating on developing a vast resource on the brain.

To gain more insight on the projects you will impact, please visit the Allen Brain Map data portal (portal.brain-map.org) and BICCN portal (biccn.org)

The Allen Institute believes that team science significantly benefits from the participation of diverse voices, experiences, and backgrounds. High-quality science can only be produced when it includes different perspectives. We are committed to increasing diversity across every team and encourage people from all backgrounds to apply for this role.

Essential Functions

  • Perform data wrangling, data curation, and data validation and serve as a liaison and technical support for consortium scientists
  • Lead efforts in collaboration with domain experts to prototype data structures
  • Lead development of data management standards best practices, and policies; contribute to development of data governance strategy and data life-cycle
  • Lead and contribute to efforts with community partners and scientists to develop and document data standards, including data quality and formats
  • Assess data and product needs to define data requirements for data integration; facilitate requirements review
  • Document data ingest processes and curation SOPs
  • Build consensus for data integration efforts across platforms and organizations and advocate for FAIR principles
  • Produce data reports and write data release notes
  • Participate in outreach efforts to curate, publish, and publicize highly-dimensional biomedical data sets, research tools, and publications
  • Codevelop requirements and support creation of infrastructure for tools to support data ingest, ETL and dashboarding
  • Identify, investigate, and solve data quality issues on an ongoing basis

Required Education and Experience

  • Bachelor degree in a relevant technical discipline (e.g., neuroscience, genomics, physics, informatics, applied math)
  • Proficiency with one of the following or similar languages: Python, R, Scala, RDF, JSON-LD, graph query language

Preferred Education and Experience

  • At least 2 years’ experience in a data-heavy role such as data science, machine learning, data analytics, or data engineering
  • Advanced degree (M.S., Ph.D.) in a relevant technical discipline (e.g., neuroscience, genomics, physics, informatics, applied math)
  • Demonstrable knowledge of and ability to apply (meta)data standards
  • Experience with non-relational/unstructured, graph or big data database development/design
  • In-depth understanding of some of the data domains that are relevant for neuroscience, imaging, and genomics
  • Familiarity with scientific data modeling, metadata management, data governance and data quality technologies
  • History of contributing to open source and/or community-based projects
  • Hands-on knowledge of cloud-based infrastructure (AWS, Azure, or GCP) and experience using data-related services
  • Experience building anomaly detection tools and alerting systems
  • Proficiency with OWL, RDF, JSON-LD, graph query languages
  • Familiarity with DMBoK
  • Strong interpersonal, communication and presentation skills
  • Strong project management and organizational skills
  • Excellent analytical and problem-solving skills combined with capacity for complex, detail-oriented work

Work Environment

  • Occasional exposure to laboratory atmosphere - possible exposure to chemical, biological or other hazardous substances

Physical Demands

  • Sitting, standing, bending, squatting as found in typical office environment

Position Type/Expected Hours of Work

  • This role is currently able to work remotely due to COVID-19 and our focus on employee safety. We are a Washington State employer, and remote work must be performed in Washington State. We continue to evaluate the safest options for our employees. As restrictions are lifted in relation to COVID-19, this role will return to work onsite.

It is the policy of the Allen Institute to provide equal employment opportunity (EEO) to all persons regardless of age, color, national origin, citizenship status, physical or mental disability, race, religion, creed, gender, sex, sexual orientation, gender identity and/or expression, genetic information, marital status, status with regard to public assistance, veteran status, or any other characteristic protected by federal, state or local law. In addition, the Allen Institute will provide reasonable accommodations for qualified individuals with disabilities.