You can definitely get OpenAI's new open source GPT models running with VLLM, but it is pricey. VLLM is built for production workloads, is very fast, and uses things like flash attention and other big optimizations but it really only works on NVIDIA H100s right now. You can rent H100s on Amazon EC2 for about $55 an hour, but you have to reserve a whole day which means it is well over $1,000 just to test things out.
That is too expensive for most people, especially when there are cheaper GPUs available. The better and more practical approach for most is to use OLAMA, which works well on both consumer GPUs and server GPUs and will handle OpenAI's open source models without a problem. For this setup, your best bet on AWS is to search for the Torch Amazon Machine Image, then pick Ubuntu as the OS since it comes with all the needed NVIDIA drivers.
For the model with 20 billion parameters, you want to pick a G5XLarge instance because it gives enough GPU memory and has a good balance of price to performance. Once your Amazon server is live, connect over SSH and install OLAMA with one simple command. Tools like nvtop for GPU status and htop for system resources are handy to keep an eye on the server.
Then, to load the OpenAI GPT-OSS 20B model, just run olama pull gpt-oss:20b and wait for the install. After that, run it with olama run gpt-oss:20b and set keep-alive to, 1 so the model stays loaded. You will see the GPU memory fill up as the model loads, and once it is ready, it will return to the prompt so you know it works.
Now the model is live and you can use it as your own self-hosted AI by sending requests through Python or just with curl, and it will respond fast now that it is loaded. The best part is you can force it to output results in a specific format, like JSON, by using tool calling. For example, you can have it extract the name of a city from a document, classify chats as suspicious or fine using enums, or label data for training smaller models.
This process is called distillation, where big models teach little ones. Deterministic JSON output is useful for making sure you can parse results and use them in programs or build simple tools, like classifying violations on parking tickets or flagging chat sessions as needing review. Using OpenAI's Python module, you can run a simple script to extract info, and by setting base URL to your AI running on AWS, you now have a full AI toolbox at your command.
The system can score chats, explain its reasoning, and even point to which part of the conversation looked suspicious. For example, asking for a password will be flagged, but just chatting about the weather is fine.
Информация по комментариям в разработке