Loss Curve
Posts
Ex-OpenAI people created Thinking Machines Lab

Ex-OpenAI people created Thinking Machines Lab

Plus, an analysis from a16z on voice agents, new materials on building eval sets, a new book on scaling LLM on GPUs, and more.

February 24, 2025

This week in Loss Curve,

Ex-OpenAI people created a new foundational model startup
Perplexity released an uncensored variant of R1
xAI released Grok 3

In addition, we also shared an analysis from a16z on voice agents, and new materials on building eval sets and scaling LLM on GPUs.

Ex-OpenAI people created a new foundational model startup

Former CTO of OpenAI Mira Murati announced Thinking Machines Lab. From her announcement tweet,

I started Thinking Machines Lab alongside a remarkable team of scientists, engineers, and builders. We're building three things:
- Helping people adapt AI systems to work for their specific needs
- Developing strong foundations to build more capable AI systems
- Fostering a… x.com/i/web/status/1…
— Mira Murati (@miramurati)
6:33 PM • Feb 18, 2025

The team consists of famous researchers from top-tier research labs like OpenAI, Google DeepMind, and Meta (read the startup website for the founding members). Though there is no public product roadmap, from the tweet the company will work on both developing foundation models (with at least some being open weight / open source) and developing products on top of the models.

There are contradictory views when it comes to the foundation model companies. While some think these models are increasingly becoming a commodity, new foundation model companies can raise 100s of millions to even a billion (SSI).

Perplexity released a de-censored DeepSeek-R1 variant

The new model version, R1 1776, is post-trained on DeepSeek's reasoning model R1. Labeled itself with the year of the US independence, R1 1776 gives more comprehensive responses to prompts censored by the Chinese government, which original R1 responses with either CCP’s stance or refusal to respond. To achieve this, Perplexity constructed a post training dataset by gathering 40k multilingual prompts and responses that cover approximately 300 sensitive topics in mainland China. Their blog posts shared that the de-censored model is able to respond to the sensitive topics, while still preserving the math and reasoning abilities at the same level as the base model.

xAI released Grok 3

The new model released by Elon Musk’s AI lab receives positive responses. It is currently sat at the top of the LLM Chatbot Arena. Andrew Karpathy posted a detailed evaluation on this model, and based on the “vibe” check believes Grok 3 is “around o1-pro capability, and ahead of DeepSeek-R1”. xAI was founded in March 2023. In just two years, it demonstrates that it can create a frontier model on par with the rivals like OpenAI and Anthropic. It shows that with enough fundings (read: GPU) and talent, it is possible to catch up with the state-of-the-art quickly in the Generative AI field.

Deep Dive

AI voice agents space review from a16z

a16z published a deck in late January surveying the landscape of AI voice agent startups. In general, they believe the “latency and reliability are now largely solved - and interruptibility and emotionality have made major strides, too” (Interruptibility refers to the model able to stop and listen to the next prompt when human speak).

They categorize the AI agent companies into three groups: (1) the model companies, e.g., OpenAI, ElevenLabs, and Cartesia; (2) the horizontal platforms, e.g., Bland AI and Vapi AI, which uses third-party model, but provide much comprehensive infra supports such as connecting to voice calls; and (3) and vertical companies provides tailored services for specific industries, e.g., 11x (sales), Slang AI (restaurants).

To create a vertical company, it’s valuable to identify industries with high call center spend. In addition, a16z listed six good traits for these vertical companies and their targeted industries:

Phone is preferred communications.
Call is constrained, both in length and content.
Voice agents can reduce the cost 50%+.
Call is mission critical. Use cases should start with acute pain points and then expand, e.g., after-hour calls, back-office calls, outbound calls for leads.
Call leads to revenue.
Voice is a feature, but not a product by itself. Vertical companies need to build full workflows, pushing call details to a CRM, automated follow-ups, etc.

Voice agent space is already crowded. The deck shows that there are already multiple companies in most verticals.

@OpenAI@elevenlabsio@cartesia@happyrobot_ai@TomaAuto@11x_official B2B voice agent companies:
- Home services - @heyrosieai, Drillbit, @revin_ai, Broccoli, Avoca, Vida, @goodcallanswers, Sameday, @workizinc, Ringable, Netic, @GetJobber
- Restaurants - @slang_ai, @useloman, @SoundHound, @OfOneAI, @HiAuto_AI, @kea_cloud, StrideQ, Maitre D,… x.com/i/web/status/1…
— Olivia Moore (@omooretweets)
4:37 PM • Jan 29, 2025

AI Companies

Bolt: Coding agent that creates a working prototype from text prompts

Bolt is very similar to v0.dev. But it is the only coding agent I found that can create a native mobile app prototype. It creates the native app using Expo / React Native.

I am a heavy user of Cursor, and plan to try out Windsurf recently. The use cases for the coding agents like v0 and bolt and full-function IDEs like Cursor and Windsurf is very distinct now. The coding agents are similar to Canva. They are great for everyone and can get fast results, but it’s not very customizable. The AI supported full-function IDEs are like Figma. They are very customizable and powerful tool especially for professional developers.

From what I read, repl.it sits somewhere in between, but I haven’t tried it out yet.

Books and Articles

The Ultra-Scale Playbook: Training LLMs on GPU Clusters: Hugging Face released an e-book on training and scaling LLM on GPU clusters. It is a long book - taking 2-4 days to read. It is a good read to pair with Google's scaling LLM on TPU book that we mentioned previously.

Building LLM Eval: A good (and more likely than not, private) eval set can be a key differentiable asset for vertical AI startups. But there are few resources on this topic. I came across this blog post on creating an eval set in the space of social welfare. Here is a summary of the post and my comments.

Why we need an eval set

First, we use it evaluate new models. Evaluate different foundation models for the specific problems

Eval set can “codifying our deeper knowledge of SNAP’s riskier edge cases and the scale of the adverse outcome for users into an eval
make assessment of AI's capabilities on SNAP topics an empirical (testable) question”.
It helps make informed decisions when evaluating different model providers from the cost vs latency trade-off perspective.

Second, we use it to test different holistic solutions. An eval is a tool for building products that deliver for our users' specific needs

Test the effects of different approaches without having to assess output manually or, worse, rely on inconsistent, subjective, “vibe check” assessment of output

Third, we use it for few-shot prompts: Though not mentioned in this post, eval set can be a valuable resource for building a more robust system using few-shot prompts.

The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.
— Amanda Askell (@AmandaAskell)
7:44 PM • Dec 9, 2024

How to create eval set

Step 1: use the models a lot — ideally, with a domain expert

Just using the models and taking notes on the nuanced “good”, “meh”, “bad!” is a much faster way to get to a useful starting eval set than writing or automating evals in code.
Build tools to make it easier to use models, for example,

Step 2: Turn response to eval set.

If a response is valid, we can put it into the eval set.
We can also use a language model to eval another one’s output

That’s all for this week. See you next week!