![]()
My ESP32 Voice Assistant Has a 3D Printed Home and GLaDOS’s Attitude
Introduction to the Aperture Science DIY Smart Speaker Project
We have built a sophisticated, self-contained voice assistant ecosystem that marries the raw processing power of the ESP32 microcontroller with the immersive aesthetic of 3D printing and the iconic, sarcastic personality of GLaDOS from the Portal video game series. This project transcends the standard definition of a “smart speaker.” It is a statement piece of functional art, a testament to the capabilities of modern open-source hardware, and a defiant response to the walled gardens of commercial smart home technology. The goal was not merely to create a device that responds to commands, but to engineer an entity with character, a digital companion that is visually striking and audibly distinct.
The convergence of these elements—the robust ESP32 platform, the versatility of additive manufacturing, and the rich lore of gaming culture—results in a device that is as enjoyable to build as it is to use. Our implementation prioritizes low latency, offline capability, and high-fidelity audio output, ensuring that the experience is seamless. By leveraging the GPIO capabilities of the ESP32 and the acoustic properties of a custom-designed enclosure, we have optimized the hardware for voice interaction. The integration of a personality engine ensures that every interaction is memorable, transforming mundane tasks like checking the weather or toggling a smart light into an event. This article details the comprehensive design, engineering, and programming methodology required to replicate this unique creation.
The Hardware Architecture: Selecting the ESP32 Over Raspberry Pi
While many hobbyists gravitate toward the Raspberry Pi for voice assistant projects, we deliberately chose the ESP32-WROOM-32 microcontroller as the brain of this operation. The decision was driven by specific performance characteristics that are critical for a responsive, always-on device. The ESP32 offers superior real-time performance for I/O handling, lower power consumption, and a significantly smaller footprint compared to single-board computers. For a voice assistant that relies on cloud-based Speech-to-Text (STT) and Text-to-Speech (TTS) APIs, the heavy lifting is done remotely; the local device simply needs to handle audio capture, buffering, Wi-Fi transmission, and playback. The ESP32 handles these tasks with remarkable efficiency.
Audio Subsystem: Microphones and Amplification
The quality of a voice assistant is defined by its ability to hear and be heard. We utilized the INMP441 MEMS omnidirectional microphone for audio input. This microphone features an I2S digital output, which allows the ESP32 to capture high-fidelity audio without the signal degradation associated with analog inputs. This eliminates the need for a complex pre-amplification stage and reduces noise floors significantly. The I2S interface on the ESP32 is robust and, when configured correctly, provides a pristine digital stream of audio data ready for transmission to an STT engine like Google Speech-to-Text or a custom Whisper implementation running on a local server.
For output, we integrated the MAX98357 I2S Class D amplifier paired with a high-efficiency 3W or 5W speaker. The MAX98357 is a compact module that takes the digital I2S audio stream from the ESP32 and converts it back to a high-quality analog signal, amplifying it sufficiently to fill a room. This combination ensures that the synthesized voice is clear, crisp, and devoid of the buzzing or humming that often plagues analog audio circuits on microcontrollers. The power management for the audio subsystem is critical; we recommend a dedicated power supply rail or a robust low-dropout regulator (LDO) to prevent voltage drops during high-volume playback.
Connectivity and Control Interface
The ESP32’s native Wi-Fi and Bluetooth capabilities are the backbone of the device. We configured the device in Station Mode to connect to the local network, facilitating communication with backend services. To make the device truly interactive, we implemented physical controls using tactile push buttons interfaced with the ESP32’s GPIO pins. These buttons serve as a hardware kill switch for the microphone (privacy mode) and a manual trigger for the voice assistant, mimicking the “push-to-talk” functionality found in high-end smart displays. We also included a WS2812B (NeoPixel) LED ring in the build. This ring serves as a visual indicator: a spinning animation indicates “listening,” a red color indicates “privacy mode active,” and a green pulse confirms a successful command execution. This visual feedback loop is essential for user experience, bridging the gap between the digital assistant and the physical world.
The Enclosure: 3D Printed Design and Acoustic Engineering
The physical housing of the device is not merely a protective shell; it is an acoustic chamber and a visual tribute to the Aperture Science Inentec Center. We designed the enclosure using Computer-Aided Design (CAD) software, specifically Fusion 360, to accommodate the specific dimensions of the PCB, speaker, and amplifier module. The design philosophy focused on two main pillars: structural integrity and sound propagation.
Slicer Settings and Material Selection
We utilized a Creality Ender 3 V2 for the fabrication process. For the primary body of the device, we selected Polylite PLA (Polylactic Acid). While ABS offers higher heat resistance, PLA provides superior dimensional accuracy and minimal warping, which is crucial for fitting the electronics snugly. We utilized a 0.4mm nozzle with a 0.2mm layer height for the outer walls to ensure a smooth surface finish that resembles the glossy, industrial look of Aperture Science technology. For the internal baffles that separate the speaker from the microphone, we utilized PETG for its slight dampening properties, which helps reduce internal resonance that could muddy the audio quality.
The print strategy involved printing the main body upside down to minimize the need for support material on the top surface where the LED ring sits. The speaker grill area was designed using a Voronoi pattern or horizontal slats, optimized to provide maximum surface area for sound transmission while maintaining structural rigidity. We designed the tolerances for the snap-fit assembly to be tight, requiring a gentle press-fit for the internal components, eliminating the need for adhesives and making the device serviceable.
Acoustic Dampening and Isolation
To prevent feedback loops where the speaker output is picked up by the microphone (a common issue in voice assistants), we implemented a baffle system inside the 3D printed case. The microphone (INMP441) is mounted on a separate small PCB and placed in a dedicated chamber, isolated from the main speaker cavity by a wall of printed plastic and a layer of acoustic foam. This physical separation ensures that the microphone only captures the user’s voice from the front, not the device’s own audio output. This acoustic engineering detail is often overlooked in DIY projects but is mandatory for reliable “wake word” detection.
The Personality Engine: Implementing the GLaDOS Attitude
The defining feature of this voice assistant is its personality. We eschewed the standard, polite responses of commercial assistants in favor of the passive-aggressive, sarcastic, and darkly humorous tone of GLaDOS (Genetic Lifeform and Disk Operating System). This required a custom software layer that intercepts standard API responses and transforms them.
Speech Synthesis and Voice Cloning
Standard Text-to-Speech (TTS) engines are often too robotic or too cheerful. To capture the essence of GLaDOS, we experimented with several TTS APIs. We found that using a high-quality WaveNet or Neural TTS engine and applying post-processing audio filters (lowering the pitch slightly, adding a metallic reverb) produces a convincing approximation. For a truly authentic experience, we integrated a dataset of GLaDOS quotes into the response logic.
Instead of saying “Here is the weather,” the device is programmed to say, “The weather is currently pathetic, much like your life choices. It is 72 degrees outside. Please stay indoors to save the rest of us from your presence.” This logic is handled in the backend Python script (running on a separate server or Raspberry Pi) that acts as the middleware between the ESP32 and the user. The ESP32 simply receives a text string, but the “Brain” curates that string with personality before sending it to the TTS engine.
Context-Aware Logic and State Management
We implemented a Finite State Machine (FSM) within the ESP32 firmware to manage the device’s states: IDLE, LISTENING, PROCESSING, and SPEAKING. The state transitions trigger visual cues (LED ring) and audio cues (specific chimes taken from the Portal game files).
- IDLE: The device is connected to Wi-Fi, waiting for a GPIO interrupt from the push button or a software trigger.
- LISTENING: The button is pressed. The I2S driver begins buffering audio data into a circular buffer. The LED ring glows cyan and spins.
- PROCESSING: Audio capture stops. The data is packetized and sent via HTTP POST to the backend server. The LED ring turns solid yellow. The device waits for the backend to return a text string.
- SPEAKING: The backend sends the generated audio file (or text for local TTS) back to the ESP32. The I2S driver on the ESP32 reads the file from the SD card (if local) or streams it (if buffering allows) and plays it through the MAX98357. The LED ring pulses green.
Firmware Development: ESP-IDF vs. Arduino Framework
We developed the firmware using the Arduino Framework running within the PlatformIO IDE. While the ESP-IDF offers lower-level control, the Arduino libraries provide a vast ecosystem of existing drivers for the I2S, Wi-Fi, and HTTP clients that significantly speed up development.
Establishing Wi-Fi Connectivity
Upon boot, the ESP32 attempts to connect to the configured Wi-Fi network. We implemented WiFiManager to allow the user to configure the SSID and password via a captive portal if the device cannot connect. This ensures the device is user-friendly and does not require hard-coding credentials in the source code. Once connected, the device obtains an IP address and broadcasts a simple mDNS name (e.g., glados.local), allowing for easier network discovery.
Handling Audio Data via I2S
Configuring the I2S driver on the ESP32 is the most technically demanding part of the firmware. We must match the sample rate (usually 16kHz or 44.1kHz) of the microphone and the amplifier. The ESP32 has two I2S peripherals; we utilize one for input (RX) and one for output (TX). The audio buffer management is critical to prevent overflows. We use FreeRTOS tasks to handle the heavy lifting. One task handles the network stack, while a high-priority task handles the real-time audio I/O. This multitasking capability of the ESP32 ensures that the device never stutters or crashes, even when handling complex commands.
Backend Integration: The Middleware
The ESP32 is the interface, but the “intelligence” resides in the backend. We set up a Python Flask server on a local Raspberry Pi (or a cloud VPS) to handle the logic.
- Audio Reception: The server receives the raw audio buffer from the ESP32.
- Speech-to-Text (STT): It sends this audio to an STT API (like Google Cloud Speech or OpenAI Whisper). The resulting text is parsed.
- Intent Recognition: We use a lightweight Natural Language Processing (NLP) library like Rasa or even simple regex matching to determine what the user wants. For example, detecting keywords like “weather,” “lights,” or “time.”
- Action Execution: If the intent is to control a smart home device, the server makes an API call to Home Assistant, OpenHAB, or a simple MQTT broker.
- Response Generation (The GLaDOS Layer): The server queries a database of GLaDOS-style responses based on the intent. It constructs the response string.
- Text-to-Speech (TTS): The server sends the constructed string to a TTS engine. The resulting audio file (WAV/MP3) is saved.
- Delivery: The server sends a signal back to the ESP32 that the audio is ready, or streams the audio directly (if the ESP32 has enough RAM to buffer).
This architecture offloads the computationally expensive tasks (STT/TTS) from the microcontroller, keeping the device responsive and lightweight.
Deployment and Integration with Magisk Modules Repository
While the device is a standalone hardware project, we understand that advanced users often manage their smart home environments using rooted Android devices. At Magisk Modules, we are dedicated to providing the best tools for device customization. If you are using an Android device as a central hub for your IoT network, stability and performance are paramount.
For users running Home Assistant Companion App on a rooted Android device to monitor their ESP32 GLaDOS, we highly recommend optimizing the Android system to prevent background process killing. Although this specific ESP32 project is hardware-centric, the Android ecosystem plays a massive role in the modern smart home. If you are looking for modules that enhance system stability, battery optimization for always-on dashboards, or kernel tweaks for better network performance, the Magisk Modules Repository is the ultimate resource.
We curate a collection of modules designed to ensure that your device, whether it is the host for the backend server or a control interface, runs flawlessly. While this project focuses on the voice assistant, the ecosystem you build around it matters. Visit the Magisk Module Repository at https://magiskmodule.gitlab.io/magisk-modules-repo/ to discover modules that can help you maintain 100% uptime and efficiency in your smart home setup.
Power Management and Enclosure Assembly
Soldering and PCB Design
We recommend designing a small custom PCB or using a perforated protoboard to create a stable foundation for the components. The connections between the ESP32, MAX98357, and INMP441 must be secure. We used 22 AWG wire for power lines to handle current surges during speaker playback and 20 AWG for the I2S data lines to minimize interference. All solder joints were insulated with heat-shrink tubing to prevent shorting against the 3D printed case.
Cable Management
Inside a 3D printed enclosure, space is at a premium. We utilized a modular wiring approach. The microphone was connected via a short ribbon cable, and the speaker wires were routed through a channel designed into the print. This prevents the wires from obstructing the sound path or vibrating against the plastic, which causes unwanted buzzing. The USB-C power port was recessed and secured with hot glue or a mounting bracket to ensure it doesn’t get pushed inside the case when plugging in the cable.
Final Assembly
The assembly process is a “shell game.” First, the speaker is mounted in the bottom chamber using double-sided foam tape to isolate vibrations. Next, the microphone module is placed in the front chamber. The main PCB sits in the middle, with the ESP32 antenna positioned to avoid being blocked by the metal speaker magnet. The LED ring is the last component installed, sitting flush in the top groove. The two halves of the 3D print are then snapped or screwed together. The result is a seamless, solid unit that feels like a commercial product, not a prototype.
Troubleshooting Common Issues
Even with careful planning, projects like this can encounter hurdles. We have compiled a list of common issues and their solutions based on our development process.
1. I2S Clock Errors: If the audio is choppy or plays too fast/slow, the I2S clock configuration is incorrect. Ensure that the sample rate in the code matches the capabilities of your microphone and amplifier. Also, check that the ESP32 is running on a stable power supply; voltage drops can corrupt the I2S timing.
2. Wi-Fi Connectivity Drops: The ESP32 can sometimes struggle with Wi-Fi if the power supply is noisy. Add a capacitor (100uF) across the 3.3V and GND rails near the ESP32 to smooth out power fluctuations. If the device is far from the router, consider using an external antenna version of the ESP32.
3. Feedback Loop: If the device starts singing when it speaks, the acoustic isolation is insufficient. Double-check that the “walls” inside the 3D printed case are airtight. You may need to add more acoustic foam between the speaker and the microphone compartment.
4. GLaDOS Voice Sounds “Off”: If the TTS doesn’t sound right, experiment with the speed and pitch parameters. A slightly slower speed (0.9x) and a pitch drop of 2-3 semitones usually yields the best results for that specific character.
Conclusion: The Future of DIY Smart Assistants
Building an ESP32 voice assistant with a 3D printed home and GLaDOS’s attitude is a rewarding journey into the intersection of hardware engineering, software development, and creative design. It proves that you do not need to surrender your privacy or aesthetics to have a functional, entertaining smart assistant. By controlling every aspect of the hardware and software, we have created a device that is truly ours, free from the data mining and arbitrary restrictions of corporate alternatives.
This project serves as a blueprint for future DIY endeavors. It highlights the power of the ESP32, the necessity of good mechanical design for acoustic performance, and the joy of injecting personality into technology. We hope this guide empowers you to build your own “Aperture Science” companion. For more information on optimizing the systems that power these projects, and for a wide array of customization tools for your rooted devices, be sure to explore the Magisk Modules Repository at https://magiskmodule.gitlab.io/magisk-modules-repo/.
The age of the boring, beige smart speaker is over. Welcome to the era of attitude, engineering, and 3D printed perfection.