Applications that provide a search service for things we need are the way of the future. One of  the most popular application categories today are ride-sharing apps like Lyft and Uber.

Founded in 2012, Lyft has quickly become one of the largest transportation networks in the United States and Canada as the world shifts away from car ownership towards transportation-as-a-service.

Their mission? To improve people’s lives with the world’s best transportation. Lyft is making good on that mission with a transportation network that includes ridesharing, bikes, scooters, car rentals, and transit all available from a single phone app.  

Challenges in Scaling the Lyft App

With so much growth, making wise use of the data flowing into the application requires technology that can support it. Lyft relies on a cloud-based development infrastructure based on Amazon Web Services (AWS), including Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

Initially, Lyft’s front-end service was dependent on Amazon’s Redshift data warehouse and Kinesis message bus as data stores but encountered issues in scaling the application to keep up with the volume of frequent users due to the tight coupling of compute and storage limitations. To resolve this, they elected to migrate from Redshift to Apache Hive on AWS cloud.

With the constant influx of new datasets from various sources, including SQL tables, Presto, Hive, Postgres, as well as dashboards in BI tools like Mode, Superset, and Tableau, Lyft had little insight into their data lineage and the impact of changes in their data flow and access.

To maintain their upward mobility they knew they needed fast, flexible access to data to power their application and services, visualize information flow, identify and monitor errors, and conduct impact analyses of changes to their data.

Lyft Creates Amundsen Tool to Improve Data Access

To provide faster access to the targeted data users need, Lyft developed a backend data discovery tool, along with co-creator Mark Grover, named Amundsen (after the Norwegian Explorer, Roald Amundsen).

Amundsen is an open-source data discovery and metadata engine that enables data science engineers and software engineers to gather necessary data from numerous pipelines into a central place and improve their productivity by up to 20%.

Amundsen data builder enables users to:

  • Search for data assets with a simple search-engine using a PageRank-inspired search algorithm that recommends results.
  • Use a metadata service to view curated data, including user information such as statistics and when a table was last updated.
  • Learn from others by seeing what data your co-workers use most, common queries, and table-based dashboards.

Data sources can include:

  • Data stores like Hive, Presto, MySQL.
  • BI/reporting tools like Tableau, Looker, and Apache Superset.
  • Events and schemas are stored in schema registries.
  • Streams like Apache Kafka, and AWS Kinesis.
  • Processing information from ETL jobs and machine learning workflows

Unfortunately, one drawback to Amundsen is that the data it pulls is represented in a static table format with little insight into where it came from and how it’s being used—like a glossary with no definitions.

Users must then try to fill in the gaps themselves through manual mapping of data lineage which can prove time-consuming and rife with error.

Improving the Data Model with CompilerWorks Lineage

To give users greater ability to trace the lineage of data from its various sources in Amundsen, Lyft employed CompilerWorks Lineage to better understand what data is being used, by whom, for what, and how it was processed.

Since it was deployed in 2018,  CompilerWorks Lineage has become an integral part of the success of Lyft’s data scientists, engineers, and business users.

CompilerWorks Lineage use cases include:

  • Data Exploration
  • Data Quality
  • Pipeline Migration
  • Cost Control
  • Usage Tracking and Reporting
  • Onboarding New Data Analysts, Data Engineers, and Scientists.

CompilerWorks Lineage and Lyft Amundsen combined enable users to: 

  • Deliver data lineage transparency and literacy
  • Enable cost-effective, confident data migrations
  • Reduce risk posed by corrupt or inaccurate data resources
  • Optimize compute resource utilization, savings millions
  • Improve workflow productivity at every level
  • Ensure data accuracy

To learn more about how Lyft is using CompilerWorks Lineage to increase data transparency, accuracy, cost efficiency, and productivity, read the full customer success story here.