Natalie Data Engineer

@nataindata

sharing weird things to stay ahead of AI Senior Data Engineer šŸ“London/Lisbon ex PepsiCo, TripAdvisor šŸ‡ŗšŸ‡¦
Followers
101k
Following
653
Account Insight
Score
63.71%
Index
Health Rate
%
Users Ratio
155:1
Weeks posts
Hey, I’m a Senior Data Engineer at TripAdvisor (ex PepsiCo) with more than 7 years dealing with Data. Completely self-taught, I’ve started with 6-month bootcamp (thought I’m gonna be a Python developer, lol) and then picked up everything else on the go. I have a community of over 80,000 data and AI enthusiasts ā¤ļø My content is mostly memes and educational videos (in funny and relatable format), covering many trendy topics. You can share your thoughts, doubts and ask for a help, cause my people are amazing and always ready to help ā¤ļø My road was bumpy… I remember 2018: I’m staring at my laptop šŸ’» ā€˜What even is Apache Kafka?’ ā€˜Do I need to learn Java?’ ā€˜Hadoop is essential!? No wait, learn dbt!’ I felt like I was drowning in tutorials... But you CAN become a Data Engineer: - FASTER - EASIER - MENTORED If you want to start your career as a Data Engineer: Drop the šŸ”„ in the comments and I’ll send you the details
281 341
10 months ago
āš”ļø Why Spotify Wrapped is a smart*$$ in processing petabytes of data? (My previous reel went completely viral (3 Million of Views!) and you voted for a detailed explanation) šŸ¬Ā SPOILER: they’ve decreased 50% of their cloud costs with this! Spotify Wrapped is a giant distributed ETL pipeline, that uses a technique called Sort Merge Bucket (SMB) join. Spotify uses 3 main data sources for Wrapped: - Streaming activity šŸŽ§ - User metadata šŸ‘¶ - Streaming context ā° Tech stack: GCP platform with Scala based Dataflow, Avro 🧃 Here is the juice: These sources are converted to SMB format, which is bucketing and sorting data by user_id SMB is a technique where: 1. Bucketing: Data (usually the join column) is divided into smaller parts called buckets 2. Sorting: The data within each bucket is then sorted 3. Merging: When combining two datasets, like matching users with their listening history, SMB speeds things up because both datasets are **already bucketed and sorted**. It’s joined using a merge-sort algorithm, which is faster than traditional join methods Small tweak here: - The sortMergeTransform function is used to combine the 3 data sources, reading each one keyed by user_id. - This allows Spotify to join roughly 1PB of data without using conventional shuffle or Bigtable. šŸ˜®ā€šŸ’ØĀ The rest is simple: Smaller jobs aggregate a week’s or day’s worth of data for each user. And then weekly partitions are aggregated into 1 year’s worth of data. āš”ļø This ended up being a huge cost savings , we managed to join roughly a total of 1PB data without using conventional shuffle or Bigtable! šŸ·ļø sql, data, spotify, big data, database, #dataengineering, gcp, google cloud, python programming
13.5k 71
5 months ago
tools and tech I use Data & AI girlie. Also MCP servers I looove ā¤ļø
1,610 24
6 months ago
Anyone else feeling like this? not burnt out. not thriving. just… somewhere in the middle, running on caffeine and vibes and the constant feeling that if i don’t lock in RIGHT NOW i’m going to fall behind forever. the AI stuff moves so fast that ā€œcatching upā€ is basically a full-time job on top of your actual full-time job. I keep telling myself I’ll touch grass after I ship this one thing. that one thing keeps multiplying. if you’re in the same spiral: the gym guilt, the protein fixation, the claude-maxxing, the ā€œI should really sleepā€ - just know it’s not just you. we’re all just trying to optimize our token usage and our lives at the same time. Thoughts?
428 16
2 days ago
this is what I lived for.
15.3k 118
6 days ago
I finally read the actual paper. And here are my thoughts: I’ve read ā€œAttention is all you needā€, not a summary, not a YouTube explainer. 8 researchers at Google were trying to fix machine translation and got annoyed by a very specific problem: -> Older models, called RNNs, had to read text the way a very slow person reads a book. One word at a time. Left to right. They could not skip ahead, could not look back efficiently, and the longer the sentence, the worse they got at remembering the beginning of it. Imagine trying to understand a 200-word sentence but your brain erases what you read three seconds ago. That was the architecture powering state-of-the-art AI in 2016. So the researchers removed it entirely. The Transformer they built lets every single word look at every other word at the same time. Think of it less like reading a book and more like spreading all the pages on a table and seeing the whole story at once. That mechanism is called self-attention, and it is the core of the paper. Then they ran that process not once but 8 times in parallel, with each run learning different kinds of relationships. One head might learn grammar. Another might learn who ā€œitā€ refers to in a sentence (It is usually NOT obvious.). They called this multi-head attention. And since the model no longer processes words in order, they had to tell it where each word sits in the sequence. They did that with positional encodings, basically injecting a signal built from sine and cosine waves into the data so the model knows word 1 from word 47. The result? Trained in 12 hours on 8 GPUs. Beat every previous model on translation benchmarks. At a fraction of the cost. It reads like eight very annoyed engineers optimizing a bottleneck on a Tuesday. And yet. GPT, Claude, Gemini, every LLM you used this week all running on the exact same core idea from that 11-page paper. Insane, huh?
11.0k 60
7 days ago
I’m somewhat of a coder myself
837 43
8 days ago
reduced my token baseline by 15% after this one-time set up 5 easy once-and-forget fixes for your Claude Code. * this is ā€œweird things I do to stay ahead of AIā€ series Btw, I’ve run a test to compare before and after, and confirmed 15% reduction in tokens
165 10
14 days ago
Ep.1: 3 things that actually eat your tokens in Claude Code * Welcome to my series ā€œStrange things I do to stay ahead of AIā€ -> Follow my journey of becoming someone AI can’t replace
201 7
16 days ago
weirds things I do to stay ahead of AI
138 9
17 days ago
you don’t need Netflix with this tech drama
164 6
22 days ago
AI this or that? Spill your picks šŸ‘‡ Now you know @tech.unicorn and @nataindata favs ā¤ļø
769 26
25 days ago