DOCETL | ETL for unstructured data

Описание к видео DOCETL | ETL for unstructured data

In this recording, I explored DOCETL, an open source package for declarative data processing using the power of LLM. This reminds me of the Hadoop days when I used to write complex Java programs to create input and output formats to find the schema in unstructured data. The approached looked similar but more powerful with Gen AI.

I have modified the code a little to add the youtube parser also in the pipeline. The revise code is in this repo

https://github.com/rajib76/docetl_exa...

Code used in the video:
_________________________

Extracting the transcript from youtube vide:

import json

from youtube_transcript_api import YouTubeTranscriptApi

transcript = YouTubeTranscriptApi.get_transcript("dG9zjKpRmdY")

texts = transcript
transcript=""
for text in texts:
transcript = transcript +" " + text["text"]

print(transcript)

json_content = {"transcript":transcript.replace("'","")}

with open("transcript.json","w") as f:
f.write(str(json.dumps(json_content)))


And here is the pipeline_2.yaml for the data processing

datasets:
audio_transcripts:
path: transcript.json
type: file

default_model: gpt-4o-mini
operations:
name: extract_topics
type: map
output:
schema:
topics: list[str]
prompt: |
Analyze the following transcript :
{{ input.transcript }}
Extract and list all key topics mentioned in the transcript.
If no topics are mentioned, return an empty list.

pipeline:
steps:
name: analyze_video
input: audio_transcripts
operations:
extract_topics
output:
type: file
path: audio_topics.json
intermediate_dir: intermediate_results

Reference: https://ucbepic.github.io/docetl/

Комментарии

Информация по комментариям в разработке