Johanna previously owned all Branch content including whitepapers, blog posts, and social media, and coordinated North America webinars. She also moonlighted in the product marketing realm, having owned the product update email, written product blog posts, and helped out with landing pages.
Jul 30, 2020
This post covers what Branch does, how we think about data science, and how our data infrastructure enables our data scientists.
Since 2014, Branch has focused on building the future of mobile discovery with deep links. Deep linking allows companies to link to pages inside of their apps as if they were a website, regardless of channel or platform. It’s likely you’ve probably engaged with one of the 100B+ Branch deep links out in the wild — an example deep link is shown below.
Today, over 60,000 apps, including Airbnb, Pinterest, Reddit, Nextdoor, Buzzfeed, Twitch, Poshmark, and many more integrate with and trust Branch to power their linking infrastructure, attribution engine, and mobile analytics — across all platforms & channels.
As we continue to unify and bridge the web, mobile web and app ecosystems, we have bold and ambitious goals to power the mobile growth infrastructure for every app in the world and build a revolutionary mobile app and content discovery platform.
At Branch, we are strong proponents of a values-driven culture.
Values represent traits that every employee can expect of one another and hold each other accountable for — this starts from the very top with our founders, and carries as far down as our on-site interview process, where every interview panel has an interviewer not on the hiring team who interviews to screen for compatibility with the Branch shared values.
As data scientists, we turn to our values to define the principles that guide our day-to-day actions. Below is a list of our values, contextualized to a Data Scientist’s role at Branch.
In the section above, we highlighted how our Branch values influence our day-to-day actions as Data Scientists at Branch. Just as our values shape our day-to-day role & thought-process, so does our data infrastructure & tooling.
The decisions we’ve taken around tooling & system design have always centered around speed, scalability, reliability, privacy protection, compliance, and self-service — below is an overview of the technologies we leverage for ingesting, processing, storing & accessing data at Branch.
Our journey into the data layer starts with various Branch APIs publishing their events to Kafka. All of our data is wrapped in Protocol Buffers, giving a common data language to all consumers. Kafka serves as the backbone of our data infrastructure, powering rapid event processing, application logging, and feeding our real-time analytics pipelines.
Once data is in Kafka, it’s processed either in stream or batch. In general, stream processing allows you to get real-time, up-to-the-second data, while batch processing runs computations at a predefined interval.
Stream or batch, our Data Platform team makes our data sources easily accessible for both use cases and has an array of utilities to help you achieve your use case. Though on our Data Science team, we recommend using batch whenever possible, as our data platform has more sophisticated automation and tooling for batch than for streaming.
Examples of such tooling include custom Airflow operators & sensors to streamline workflows, built-in job monitoring & alerting, and personalized environments for testing.
An example of a typical data science workflow at Branch is:
S3 serves as our main storage layer, as it fulfills our needs for cost, reliability, scalability, and data integrity.
However, S3 isn’t the only storage system used at Branch. Many of our engineering teams engage with data in the processing layer, each with their own needs surrounding storage — be it schema structure, access pattern-based optimizations, latency requirements, or supported analytical functionality.
For example, application engineers on our Links & Attribution teams may need NoSQL datastores that are highly scalable & optimized for fast reads/writes (Aerospike, DynamoDB or FoundationDB), while our Dashboard engineers may need a relational DB that can store & efficiently query JSON blobs (Postgres), or our Fraud Data Scientists need to access massive datasets via warm storage (HDFS) to quickly load data into their fraud detection models.
Our Infrastructure team maintains & continually improves all of our data stores while our Data Platform team abstracts the complexity of accessing data across these disparate data stores by offering a secure and unified query layer.
Our Data Platform team supports two primary systems that enable Data Scientists, as well as any other business users, to query our data.
Druid
Druid is a database designed to power use cases where real-time ingest, high concurrency, fast query performance, and high uptime are critical.
It also powers one of our internal operational analytics tools, based on Turnilo, which allows our entire company(QA/Biz Dev/ Customer Success/Sales, etc) to slice and dice by >50 dimensions while getting answers with a median latency of 200ms.
Presto
For access to log-level insights & exploratory analysis, those proficient in SQL turn to Presto. Presto offers us 2 major benefits:
Our team uses multiple tools to engage with our query layer.
Looker is our primary business intelligence and data visualization tool, but for highly custom dashboards where query performance is critical, we use Tableau, as it allows us to create extracts that can substitute as materialized views (as we work on enabling materialized views for Presto in parallel).
For quick, slice & dice investigations into trends and time-series data, we use our internal operational analytics tool based on Turnilo.
To perform custom analysis or prototype models, we use either personalized Jupyter notebook research environments, where we can spin up our own Spark cluster (on Kubernetes) according to our memory & compute demands.
For the past few years, Branch has climbed our way up the data science hierarchy of needs, first building out a rock-solid data infrastructure, then a battle-hardened ELT/ETL system and finally an advanced self-service analytics & reporting toolkit, all while supporting very high throughput, high-reliability services for our customers.
With the foundation in place, we’re excited about the next evolution in our data capabilities — strengthening our product offerings with ML & Experimentation.
If you found what you read interesting, Branch’s Data Science organization is hiring! Thanks for reading, and please leave any questions or comments below.
Johanna previously owned all Branch content including whitepapers, blog posts, and social media, and coordinated North America webinars. She also moonlighted in the product marketing realm, having owned the product update email, written product blog posts, and helped out with landing pages.
Jul 30, 2020