![]() ![]() GPT treats unicode symbols as a more expensive symbols than latin symbols. Non-English words and GPT token calculator.Īnother obstacle was the difference in token cost of different languages. To mitigate this, I am now splitting the article in chunks and feeding it into OpenAI API sequentially. It turned out that I was exceeding token limit for a huge article. The tricky part of article summarization process was when I started getting 400 errors for OpenAI API. The extractor API daemon is powered by Node.js and runs on my cloud server. Then the body of an article is sent to OpenAI API. Then I apply a set of extractors to the HTML, to extract the body of an article and get rid of all the noise like website navigation, ads, and teasers of unrelated content. This would be a pretty complicated project unless I have already had awesome ScrapeNinja web scraping API running, ScrapeNinja does all the heavy lifting related to content retrieval: the extractor API basically calls ScrapeNinja API, which rotates proxies and applies smart retries strategy to get the raw HTML of web articles reliably. I will appreciate your thoughts and feedback: /summarize?url= - extract article body from a URL and summarize the body using GPT (you can specify the length of the summary, and if you want to get html format or not)./extract?url= - extracting article body from a URL (this endpoint does not use GPT, it just uses ScrapeNinja to extract data and condition it into useful format of HTML and Markdown - this might be useful if you don't need a summary and just need a clean article text for processing).My friends who hate reading long articles tested & loved it! To achieve this, I have built an API which extracts real content from any URL, parses the HTML to extract the body of an article, cleans it up, and then feeds this body of text into GPT. This is a massive idea which needs to have a good foundation of underlying tech to extract and process large amounts of text data. I am building a newsletter platform to analyze many text pages, gets authority, virality and popularity scores for these pieces of content, and then compiles a great and easy to consume digest from the best content. ![]() So, Bing Chat, great for a personal use, was not a viable approach. But, my use case required a much bigger scale and a high level of automation - I am building a data pipeline which summarizes hundreds of articles every hour, and compiles then into a beautiful digest. While this is a great deal for most of personal use cases. It uses your computer resources to load the data so it can process it.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |