About to Launch an 8-Part Series: AskSense from Conception to Deployment

- April 29, 2025

Trigger Moment: Uncovering the Real Need for Semantic Search

The spark for this project actually came from our desire to help the customer support team tackle scam messages more effectively. With hundreds of support tickets coming in every day, many contained what looked like legitimate but cleverly disguised phishing attempts. You know the type: “Please verify your account security immediately,” or “You’ve been issued a refund; click this link.” Simply checking for keywords like “security” or “refund” didn’t cut it—this approach not only missed the dangers but also flooded us with a ton of false positives.

I realized that to really get ahead of these scams, we needed the system to truly “understand” the full meaning of each message, taking into account context, intent, and nuance. By doing this, we could quickly spot suspicious messages as they came in and warn users in real time, which would help us effectively prevent fraud.

So, I decided to revamp the original FAQ-search system of AskSense into a full scam detection pipeline! When messages come in, we first convert them into semantic vectors using a multilingual SBERT model. These vectors are then processed by a trained classifier or compared against a similarity threshold to find anything that falls into a “high-risk” area. To keep up with new scam tactics, I added an unsupervised sentence-clustering step. This means that when new suspicious messages come in, we can group and manually label them to improve our detection.

Throughout the process, I've been continuously tweaking the similarity thresholds, optimizing the number of clusters, and carefully keeping track of every false positive and false negative. This helps ensure that our model can learn and improve over time, adapting to real-world conditions. It’s an ongoing journey, but I’m excited about the progress we’re making!

Series Outline: Eight Key Chapters

This series will unfold across eight posts, each peeling back a layer of AskSense’s inner workings—from the initial spark to cloud deployment—documenting every technical choice and personal insight:

Post	Title	Core Focus
1	Problem & Motivation	The real pain points and user needs that inspired the project
2	First Steps in Vectorization	Experimenting with Bag-of-Words, TF-IDF, and hitting their limits
3	Initial Embedding Experiments	Wrestling with the massive Word2Vec model and first trials with lightweight SBERT
4	Model Selection Decisions	Balancing performance, accuracy, memory, and multilingual support—why I settled on the MiniLM family
5	Multilingual Challenges	Supporting Traditional Chinese FAQs with jieba tokenization, OpenCC conversion, and multilingual models
6	Optimization & Refactoring	From “20-second cold starts” to “dependency hell”—introducing caching, virtual environments, and incremental loading
7	From CLI to Cloud: Deployment	Real-world practice with Streamlit Cloud, CI/CD pipelines, environment-variable protection, and logging
8	Future & Expansion	FAISS vector indexing, user-feedback loops, hybrid search architectures, and strategic roadmaps

Tech Stack & Tool Choices: From spaCy to SBERT

In the beginning, I used spaCy’s en_core_web_sm for POS tagging and simple vector averaging, it barely scratched the surface of meaning. Next, I tried Gensim’s Word2Vec–Google News 300: a 1.5 GB beast that would gobble all my memory and constantly trigger OOM errors. Those painful lessons taught me that choosing the right model—balancing accuracy and execution speed—is the key to real-world success. I then switched to “lightweight” Sentence-BERT: first, the English all-MiniLM-L6-v2, then the multilingual paraphrase-multilingual-MiniLM-L12-v2Each provides powerful semantic embeddings in just tens of megabytes.

Data Processing & Multilingual Challenges: Supporting Traditional Chinese

While working on extending AskSense to support Traditional Chinese, I encountered my first big challenge: Chinese word segmentation and script normalization. Unlike English, which naturally uses spaces between words, Chinese text needs a tool like jieba for tokenization. I found that jieba’s default dictionary didn’t include many terms specific to Taiwan, and as a result, common place names like “臺灣” were mistakenly split into “臺” and “灣.” To solve this, I added specific terms to jieba’s custom dictionary, including names like “臺灣,” “臺北市,” “高雄,” and “詐騙,” so that these would all be recognized as single tokens.

Next, I set up a preprocessing pipeline that used OpenCC to convert Simplified Chinese to Traditional, ensuring all texts were standardized before segmentation. This step was crucial in preventing the same concept from appearing as different tokens just because of script variations.

With segmentation and script normalization sorted out, I moved on to mapping those tokens into semantic vectors. I found that the paraphrase-multilingual-MiniLM-L12-v2 model did a great job at turning full sentences into dense embeddings, but its performance really relied on quality tokenization. To enhance this, I introduced some filtering rules to the pipeline to get rid of overly short tokens (like “的” or “了”) and excessively long ones (over five characters). This way, I could focus on nouns, verbs, and adjectives—the words that carry the main meaning. I also tried out various stop-word lists to filter out common function words.

After making these adjustments, I tested the system with the same pairs of questions (like “如何重設我的密碼？” vs. “我要怎麼修改登入密碼？”) to measure similarity scores in both Chinese-only and mixed Chinese-English settings. Through many little tweaks, AskSense achieved semantic matching accuracy in Traditional Chinese that is as impressive as its English performance, allowing it to effectively identify paraphrases and distinguish between semantically similar yet intent-different queries across languages.

Performance Bottlenecks & Solutions: Memory, Vectors, Fast Similarity

At first, every startup had to re-encode thousands of FAQ sentences, which made the cold-start time around 20 seconds—a bit long! To make things smoother, I introduced incremental vector caching. In the first run, we serialize the embeddings, and then in the following launches, we can read the entire array all at once, which is super efficient. I also took advantage of parallel serialization, which helped cut the cold-start time down to under 2 seconds—much better! For performing similarity searches on a larger scale, I integrated FAISS (I'll dive deeper into that in post 8), allowing us to retrieve information in just milliseconds, even with millions of vectors. Pretty cool, right?

Deployment & Security: Env Vars, CI/CD, and Cloud Deployments

Streamlit Cloud makes deploying your projects super easy with its “one-click” feature! Behind the scenes, though, there's a solid CI/CD pipeline and safeguards for environment variables to keep everything running smoothly. I love using GitHub Actions for automated testing and linting before I push any updates. Plus, I have a nifty checker in place to make sure no API keys slip through the cracks. To keep things running seamlessly, I even put together a Health-Check script that monitors the service and automatically restarts any stuck processes. It really helps reduce deployment headaches!

Personal Journey: Late-Night Debugging, Failures, and Problem-Solving

When I noticed our similarity scores hovering between 0.85 and 0.87, I felt like I was stuck in a never-ending dilemma between “parameter tuning and changing up the architecture.” So, I decided to give a TF-IDF threshold filter a try before calculating cosine similarity. I thought it might help minimize the influence of common, low-value words and improve our final scores. Unfortunately, it didn’t really do the trick; sometimes, we ended up filtering out important terms, which actually made the matches worse!

I rolled up my sleeves and started testing thousands of parameter combinations. I played around with things like splitting the corpus, adjusting stop-word lists, and filtering by parts of speech. Let’s just say I found myself working late into the night more often than I’d like to admit! The big takeaway from this experience was that while parameter tuning can be helpful, it can also be a bit of a maze. Without clear success metrics and taking small steps, it’s easy to feel like you’re making progress when you might still be in the dark about what’s really working.

This all led me to embrace A/B testing and balanced precision-recall monitoring. Now, every change we make is based on solid data instead of guesswork, and that feels so much better!

Another time, I experimented with a hybrid retrieval strategy that used Bag of Words (BoW) for quick keyword recall and SBERT for fine-tuning the results. The idea was to take advantage of BoW’s speed to narrow down our options, then let SBERT do its magic for precise ordering. However, the results turned out to be nearly the same as using SBERT on its own, and in some cases, they were actually a bit worse. I learned that just stacking features doesn’t guarantee better performance—it can also add complexity and noise.

This prompted me to shift my focus. I started to figure out when it was best to simplify by removing unnecessary steps and when to add new components only after we had a solid core pipeline. By validating each change with measurable metrics, I ultimately found a great balance between model quality and engineering efficiency.

What’s Next: FAISS, Feedback Loops, and Hybrid Search

In my eighth post, I’m excited to share how I integrated FAISS vector indexing into the AskSense search pipeline, turning what used to be a lengthy scan through thousands of vectors into lightning-fast lookups that happen in just milliseconds! I started by using FAISS for inner-product indexing, and I implemented an Inverted File (IVF) architecture during the vector quantization phase, grouping all sentence embeddings into hundreds of centroids. This means that when you make a query, AskSense only has to search in the most relevant clusters, which really speeds things up!

To avoid those annoying cold-start delays, I saved both the index and its mapping tables in a binary format and stored them in the cloud. This way, the app can load everything directly when it starts up. This change has boosted search speeds by over a hundredfold and keeps AskSense responsive, even when there’s a surge in user queries.

On top of that, I set up a “dynamic retraining” workflow that taps into real user feedback. Whenever someone selects an answer and rates it as correct or incorrect, the system records that feedback along with the original sentence in a database. At regular intervals, we use these labeled examples to fine-tune the SBERT model through batch training, which creates new embeddings and updates the FAISS index.

I also had fun experimenting with a hybrid retrieval setup. I used TF-IDF for quick initial retrieval to weed out any obviously unrelated candidates, then turned to SBERT for more precise re-ranking. Through A/B testing, I fine-tuned the balance between recall and precision to nail down the best approach. The end result? AskSense has evolved from a basic Q&A tool into a smart assistant that learns from your behavior and gets better with every deployment!

In this exciting eight-post series, I’m thrilled to share real-life development stories, some of the tough debugging moments, and detailed technical insights that went into creating AskSense. I can’t wait to connect with all of you! Please feel free to join the conversation and share your own experiences with NLP—whether it's breakthroughs or challenges—in the comments. Let’s learn and grow together!

Search This Blog

J’s Digest